Build Advanced Multi-Endpoint Machine Learning Apis with LitServe : Batching Streaming Caching and Local Inference

We explore this topic in detail. LitServeIt is an easy-to-use and powerful framework for serving APIs that allow us to quickly deploy machine learning algorithms. We develop and test a number of endpoints which demonstrate practical functionalities, including text generation, batching and streaming, as well as multi-task processing and caching. These are all performed locally and without external APIs. We will have a clear understanding of how to build scalable, flexible and efficient ML-serving pipelines for production applications. Click here to view the FULL CODES here.

The 'pip installation of litserve torches -q


Download litserve ls
Buy a torch
Transformers Import Pipeline
import time
Typing import list

Start by installing LitServe and PyTorch on Google Colab. Importing the necessary libraries and modules will help us define, test, and serve our APIs effectively. Visit the FULL CODES here.

class TextGeneratorAPI(ls.LitAPI):
 Def Setup(self and device)
       self.model = pipeline("text-generation", model="distilgpt2", device=0 if device == "cuda" and torch.cuda.is_available() Then -1)
       self.device = device
   def decode_request(self, request):
 Please return this request["prompt"]
   def predict(self, prompt):
       result = self.model(prompt, max_length=100, num_return_sequences=1, temperature=0.8, do_sample=True)
 Return to resultReturn[0]['generated_text']
   def encode_response(self, output):
       return {"generated_text": output, "model": "distilgpt2"}


class BatchedSentimentAPI(ls.LitAPI):
 Def Setup(self and device)
       self.model = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english", device=0 if device == "cuda" and torch.cuda.is_available() Then -1)
   def decode_request(self, request):
 Please return this request["text"]
   def batch(self, inputs: List[str]) -> List[str]:
       return inputs
   def predict(self, batch: List[str]):
 Results = Self.model(batch).
 Results of the return
   def unbatch(self, output):
 Return outputReturn
   def encode_response(self, output):
       return {"label"The output["label"], "score": float(output["score"]), "batched": True}

We create here two LitServe-APIs: one for text creation using a DistilGPT2 local model, and the other for batch sentiment analysis. The APIs are defined by how they decode incoming requests and perform inference. They then return structured responses. Click here to see the FULL CODES here.

class StreamingTextAPI(ls.LitAPI):
 Def Setup(self and device)
       self.model = pipeline("text-generation", model="distilgpt2", device=0 if device == "cuda" and torch.cuda.is_available() Then -1)
   def decode_request(self, request):
 Please return this request["prompt"]
   def predict(self, prompt):
       words = ["Once", "upon", "a", "time", "in", "a", "digital", "world"]
 For word by words
           time.sleep(0.1)
           yield word + " "
   def encode_response(self, output):
 For token output:"
           yield {"token": token}

This section outlines a text generation API which emits tokens in real-time. LitServe’s ability to generate continuous tokens is demonstrated by yielding one word at a moment. Click here to see the FULL CODES here.

class MultiTaskAPI(ls.LitAPI):
 Def Setup(self and device)Return
       self.sentiment = pipeline("sentiment-analysis", device=-1)
       self.summarizer = pipeline("summarization", model="sshleifer/distilbart-cnn-6-6", device=-1)
       self.device = device
   def decode_request(self, request):
       return {"task"Please contact us if you have any questions..get("task", "sentiment"), "text": request["text"]}
 def predict (self inputs, outputs)
 Task = inputs["task"]
       text = inputs["text"]
 if task== "sentiment":
 "text" = resultReturn[0]
           return {"task": "sentiment", "result": result}
 Elif Task == "summarize":
 If len(text.split())

Now we develop a multitask API which handles sentiment analysis as well as summarization through a single endpoint. This code snippet shows how to manage multiple pipelines using a single interface. Each request is dynamically routed according to its task. Click here to view the FULL CODES here.

class CachedAPI(ls.LitAPI):
 Def Setup(self and device)
       self.model = pipeline("sentiment-analysis", device=-1)
       self.cache = {}
       self.hits = 0
       self.misses = 0
   def decode_request(self, request):
 Please return this request["text"]
   def predict(self, text):
 If text is in cache:
           self.hits += 1
 Return self-cache[text]True
       self.misses += 1
 Self.model = result[0]
       self.cache[text] Example:
 False
   def encode_response(self, output):
 Result from cache = outputReturn
       return {"label"The result["label"], "score": float(result["score"]), "from_cache": from_cache, "cache_stats": {"hits": self.hits, "misses": self.misses}}

Use our API to implement a new application. caching To reduce redundant computations for repeated requests, previous inference results can be stored. Cache hits and misses are tracked in real-time, showing how caching can dramatically improve performance. See the FULL CODES here.

Test local APIs with def test_apis():
   print("=" * 70)
   print("Testing APIs Locally (No Server)")
   print("=" * 70)


 TextGeneratorAPI = api1(); api1.setup("cpu")
   decoded = api1.decode_request({"prompt": "Artificial intelligence will"})
   result = api1.predict(decoded)
   encoded = api1.encode_response(result)
   print(f"✓ Result: {encoded['generated_text'][:100]}...")


 api2 = BatchedSentimentAPI(); api2.setup("cpu")
 Texts = ["I love Python!", "This is terrible.", "Neutral statement."]
   decoded_batch = [api2.decode_request({"text": t}) for t in texts]
   batched = api2.batch(decoded_batch)
   results = api2.predict(batched)
   unbatched = api2.unbatch(results)
   for i, r in enumerate(unbatched):
       encoded = api2.encode_response(r)
       print(f"✓ '{texts[i]}' -> {encoded['label']} ({encoded['score']:.2f})")


 MultiTaskAPI = MultiTaskAPI(); api3.setup("cpu")
   decoded = api3.decode_request({"task": "sentiment", "text": "Amazing tutorial!"})
   result = api3.predict(decoded)
   print(f"✓ Sentiment: {result['result']}")


 CachedAPI = api4(); api4.setup("cpu")
   test_text = "LitServe is awesome!"
   for i in range(3):
       decoded = api4.decode_request({"text": test_text})
       result = api4.predict(decoded)
       encoded = api4.encode_response(result)
       print(f"✓ Request {i+1}: {encoded['label']} (cached: {encoded['from_cache']})")


   print("=" * 70)
   print("✅ All tests completed successfully!")
   print("=" * 70)


test_apis_locally()

All our APIs are tested locally, to ensure their accuracy and correctness. This is done without launching an external server. Our LitServe system is tested sequentially for text generation and sentiment analysis. We also test multi-tasking and caching to ensure that each part of the LitServe works smoothly.

We demonstrate the flexibility of LitServe by creating and running diverse APIs. LitServe allows us to experiment with a variety of features, including text generation and sentiment analysis.MLHugging Face integration. After completing the tutorial we see how LitServe can simplify model deployment workflows. This allows us to use intelligent ML systems with just a few Python lines, maintaining simplicity, flexibility and performance.

Take a look at the FULL CODES here. Please feel free to browse our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

Asif Razzaq serves as the CEO at Marktechpost Media Inc. As an entrepreneur, Asif has a passion for harnessing Artificial Intelligence to benefit society. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. Over 2 million views per month are a testament to the platform’s popularity.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

Build Advanced Multi-Endpoint Machine Learning Apis with LitServe : Batching Streaming Caching and Local Inference

Cursor Releases TypeScript-based SDKs for Building Coding Agents with Sandboxed Cloud Virtual Machines, Subagents Hooks and Token Based Pricing

The smol Audio Notebook: An Adaptive Collection of Notebooks for Whisper, Parakeet Voxtral Granite Speech and Audio Flamingo 3.

Qwen Team Releases FlashQLA: a High-Performance Linear Attention Kernel Library That Achieves Up to 3× Speedup on NVIDIA Hopper GPUs

The Top 10 Compression techniques for LLM inference using KV cache: Reduced memory overhead across evictions, low-rank methods, and quantization

OnlyFans models who look like your crush can be found using the search engine

Apple’s AI Ambitions Leaves Big Questions About Its Climate Goals

Claude fans held a funeral for retired AI model Anthropic

New York has become the latest State to think about a data centre pause

Google Pixel 10 series, Pixel Watch 4 Pixel Buds: Features, Specs, and Release Date

Top Insights

Hugging Face Open-Sourced FineVision: A New Multimodal Dataset with 24 Million Samples for Training Vision-Language Models (VLMs)

The WIRED roundup: DOGE isn’t dead, Facebook dating has real potential, and Amazon’s AI ambitions

Latest News

Reid Hoffman believes doctors should consult AI to get a second opinion

Cursor Releases TypeScript-based SDKs for Building Coding Agents with Sandboxed Cloud Virtual Machines, Subagents Hooks and Token Based Pricing

Build Advanced Multi-Endpoint Machine Learning Apis with LitServe : Batching Streaming Caching and Local Inference

Related Posts