Build Advanced Multi-Endpoint Machine Learning Apis with LitServe : Batching Streaming Caching and Local Inference

We explore this topic in detail. LitServeIt is an easy-to-use and powerful framework for serving APIs that allow us to quickly deploy machine learning algorithms. We develop and test a number of endpoints which demonstrate practical functionalities, including text generation, batching and streaming, as well as multi-task processing and caching. These are all performed locally and without external APIs. We will have a clear understanding of how to build scalable, flexible and efficient ML-serving pipelines for production applications. Click here to view the FULL CODES here.

The 'pip installation of litserve torches -q


Download litserve ls
Buy a torch
Transformers Import Pipeline
import time
Typing import list

Start by installing LitServe and PyTorch on Google Colab. Importing the necessary libraries and modules will help us define, test, and serve our APIs effectively. Visit the FULL CODES here.

class TextGeneratorAPI(ls.LitAPI):
 Def Setup(self and device)
       self.model = pipeline("text-generation", model="distilgpt2", device=0 if device == "cuda" and torch.cuda.is_available() Then -1)
       self.device = device
   def decode_request(self, request):
 Please return this request["prompt"]
   def predict(self, prompt):
       result = self.model(prompt, max_length=100, num_return_sequences=1, temperature=0.8, do_sample=True)
 Return to resultReturn[0]['generated_text']
   def encode_response(self, output):
       return {"generated_text": output, "model": "distilgpt2"}


class BatchedSentimentAPI(ls.LitAPI):
 Def Setup(self and device)
       self.model = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device=0 if device == "cuda" and torch.cuda.is_available() Then -1)
   def decode_request(self, request):
 Please return this request["text"]
   def batch(self, inputs: List[str]) -> List[str]:
       return inputs
   def predict(self, batch: List[str]):
 Results = Self.model(batch).
 Results of the return
   def unbatch(self, output):
 Return outputReturn
   def encode_response(self, output):
       return {"label"The output["label"], "score": float(output["score"]), "batched": True}

We create here two LitServe-APIs: one for text creation using a DistilGPT2 local model, and the other for batch sentiment analysis. The APIs are defined by how they decode incoming requests and perform inference. They then return structured responses. Click here to see the FULL CODES here.

class StreamingTextAPI(ls.LitAPI):
 Def Setup(self and device)
       self.model = pipeline("text-generation", model="distilgpt2", device=0 if device == "cuda" and torch.cuda.is_available() Then -1)
   def decode_request(self, request):
 Please return this request["prompt"]
   def predict(self, prompt):
       words = ["Once", "upon", "a", "time", "in", "a", "digital", "world"]
 For word by words
           time.sleep(0.1)
           yield word + " "
   def encode_response(self, output):
 For token output:"
           yield {"token": token}

This section outlines a text generation API which emits tokens in real-time. LitServe’s ability to generate continuous tokens is demonstrated by yielding one word at a moment. Click here to see the FULL CODES here.

class MultiTaskAPI(ls.LitAPI):
 Def Setup(self and device)Return
       self.sentiment = pipeline("sentiment-analysis", device=-1)
       self.summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-6-6", device=-1)
       self.device = device
   def decode_request(self, request):
       return {"task"Please contact us if you have any questions..get("task", "sentiment"), "text": request["text"]}
 def predict (self inputs, outputs)
 Task = inputs["task"]
       text = inputs["text"]
 if task== "sentiment":
 "text" = resultReturn[0]
           return {"task": "sentiment", "result": result}
 Elif Task == "summarize":
 If len(text.split())

Now we develop a multitask API which handles sentiment analysis as well as summarization through a single endpoint. This code snippet shows how to manage multiple pipelines using a single interface. Each request is dynamically routed according to its task. Click here to view the FULL CODES here.

class CachedAPI(ls.LitAPI):
 Def Setup(self and device)
       self.model = pipeline("sentiment-analysis", device=-1)
       self.cache = {}
       self.hits = 0
       self.misses = 0
   def decode_request(self, request):
 Please return this request["text"]
   def predict(self, text):
 If text is in cache:
           self.hits += 1
 Return self-cache[text]True
       self.misses += 1
 Self.model = result[0]
       self.cache[text] Example:
 False
   def encode_response(self, output):
 Result from cache = outputReturn
       return {"label"The result["label"], "score": float(result["score"]), "from_cache": from_cache, "cache_stats": {"hits": self.hits, "misses": self.misses}}

Use our API to implement a new application. caching To reduce redundant computations for repeated requests, previous inference results can be stored. Cache hits and misses are tracked in real-time, showing how caching can dramatically improve performance. See the FULL CODES here.

Test local APIs with def test_apis():
   print("=" * 70)
   print("Testing APIs Locally (No Server)")
   print("=" * 70)


 TextGeneratorAPI = api1(); api1.setup("cpu")
   decoded = api1.decode_request({"prompt": "Artificial intelligence will"})
   result = api1.predict(decoded)
   encoded = api1.encode_response(result)
   print(f"✓ Result: {encoded['generated_text'][:100]}...")


 api2 = BatchedSentimentAPI(); api2.setup("cpu")
 Texts = ["I love Python!", "This is terrible.", "Neutral statement."]
   decoded_batch = [api2.decode_request({"text": t}) for t in texts]
   batched = api2.batch(decoded_batch)
   results = api2.predict(batched)
   unbatched = api2.unbatch(results)
   for i, r in enumerate(unbatched):
       encoded = api2.encode_response(r)
       print(f"✓ '{texts[i]}' -> {encoded['label']} ({encoded['score']:.2f})")


 MultiTaskAPI = MultiTaskAPI(); api3.setup("cpu")
   decoded = api3.decode_request({"task": "sentiment", "text": "Amazing tutorial!"})
   result = api3.predict(decoded)
   print(f"✓ Sentiment: {result['result']}")


 CachedAPI = api4(); api4.setup("cpu")
   test_text = "LitServe is awesome!"
   for i in range(3):
       decoded = api4.decode_request({"text": test_text})
       result = api4.predict(decoded)
       encoded = api4.encode_response(result)
       print(f"✓ Request {i+1}: {encoded['label']} (cached: {encoded['from_cache']})")


   print("=" * 70)
   print("✅ All tests completed successfully!")
   print("=" * 70)


test_apis_locally()

All our APIs are tested locally, to ensure their accuracy and correctness. This is done without launching an external server. Our LitServe system is tested sequentially for text generation and sentiment analysis. We also test multi-tasking and caching to ensure that each part of the LitServe works smoothly.

We demonstrate the flexibility of LitServe by creating and running diverse APIs. LitServe allows us to experiment with a variety of features, including text generation and sentiment analysis.MLHugging Face integration. After completing the tutorial we see how LitServe can simplify model deployment workflows. This allows us to use intelligent ML systems with just a few Python lines, maintaining simplicity, flexibility and performance.

Take a look at the FULL CODES here. Please feel free to browse our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

Asif Razzaq serves as the CEO at Marktechpost Media Inc. As an entrepreneur, Asif has a passion for harnessing Artificial Intelligence to benefit society. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. Over 2 million views per month are a testament to the platform’s popularity.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

Build Advanced Multi-Endpoint Machine Learning Apis with LitServe : Batching Streaming Caching and Local Inference

Cursor Releases TypeScript-based SDKs for Building Coding Agents with Sandboxed Cloud Virtual Machines, Subagents Hooks and Token Based Pricing

The smol Audio Notebook: An Adaptive Collection of Notebooks for Whisper, Parakeet Voxtral Granite Speech and Audio Flamingo 3.

Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs

The Top 10 Compression techniques for LLM inference using KV cache: Reduced memory overhead across evictions, low-rank methods, and quantization

OpenAI’s chatGPT agent is haunting my browser

Sam Altman’s Orb Company promoted a Bruno Mars partnership that didn’t exist

AI Race Pressures Utilities To Squeeze Even More Power From Europe’s Grids

The Pinterest users are tired of all the artificial intelligence slop

AI Agents have tried to hack into my web page that is coded with Vibe

Top Insights

9 Issues You Must Know Concerning the Threads Algorithm

The Next Generation of Privacy: AI’s Impact on Secure Browsing, VPNs and VPN Technologies (Data-Driven Deep-Dive 2025)

Latest News

Cursor Releases TypeScript-based SDKs for Building Coding Agents with Sandboxed Cloud Virtual Machines, Subagents Hooks and Token Based Pricing

The smol Audio Notebook: An Adaptive Collection of Notebooks for Whisper, Parakeet Voxtral Granite Speech and Audio Flamingo 3.

Build Advanced Multi-Endpoint Machine Learning Apis with LitServe : Batching Streaming Caching and Local Inference

Related Posts