We will walk you through the process of building an AI Agent which not only talks but can also recall information. From scratch, we demonstrate how to combine an LLM lightweight, FAISS Vector Search, and a Summarization Mechanism to create both long-term and short-term memories. Working together with auto-distilled facts and embedded information, we are able to create an intelligent agent who can adapt to our commands, retain important details during future conversations and intelligently compress context. See the FULL CODES here.
!pip -q install transformers accelerate bitsandbytes sentence-transformers faiss-cpu
import os, json, time, uuid, math, re
Datetime can be imported from another datetime
import torch, faiss
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, BitsAndBytesConfig
from sentence_transformers import SentenceTransformer
DEVICE = "cuda" if torch.cuda.is_available() You can also find out more about "cpu"
Installation of the libraries is followed by the importation of all required modules. Set up your environment so that you can determine if the model will run efficiently on a GPU, or CPU. Click here to view the FULL CODES here.
Def load_llm (model_name="TinyLlama/TinyLlama-1.1B-Chat-v1.0"):
try:
If DEVICE=="cuda":
bnb=BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_compute_dtype=torch.bfloat16,bnb_4bit_quant_type="nf4")
tok=AutoTokenizer.from_pretrained(model_name, use_fast=True)
mdl=AutoModelForCausalLM.from_pretrained(model_name, quantization_config=bnb, device_map="auto")
else:
tok=AutoTokenizer.from_pretrained(model_name, use_fast=True)
mdl=AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32, low_cpu_mem_usage=True)
The return pipe ("text-generation", model=mdl, tokenizer=tok, device=0 if DEVICE=="cuda" Otherwise -1 (do_sample=True).
Except Exception As e.
raise RuntimeError(f"Failed to load LLM: {e}")
The function we define loads our language model. In order to optimize performance, if there is a GPU available we set up the system so we can use 4-bit quantumization. It ensures that we can produce text efficiently regardless of which hardware we’re running on. See the FULL CODES here.
VectorMemory class:
def __init__(self, path="/content/agent_memory.json", dim=384):
self.path=path; self.dim=dim; self.items=[]
self.embedder=SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2", device=DEVICE)
self.index=faiss.IndexFlatIP(dim)
if os.path.exists(path):
data=json.load(open(path))
self.items=data.get("items",[])
If self.items are:
X=torch.tensor([x["emb"] for x in self.items], dtype=torch.float32).numpy()
self.index.add(X)
def _emb(self, text):
v=self.embedder.encode([text], normalize_embeddings=True)[0]
Return to list()
Def add(self.text, meta=None).
e=self._emb(text); self.index.add(torch.tensor([e]).numpy())
rec={"id":str(uuid.uuid4()),"text":text,"meta":meta or {}, "emb":e}
self.items.append(rec); self._save()Return Rec["id"]
def search (self, question, k=5, threshold=0.25)
If len (self.items == 0): Return []
q=self.embedder.encode([query], normalize_embeddings=True)
D,I=self.index.search(q, min(k, len(self.items)))
out=[]
for d,i in zip(D[0],I[0]):
If i==-1, continue
if d>=thresh: out.append((d,self.items[i]))
Return to the page
def _save(self):
slim=[{k:v for k,v in it.items()} for it in self.items]
json.dump({"items":slim}, open(self.path,"w"), indent=2)
Create a VectorMemory Class to give our agent long-term memories. MiniLM allows us to embed past interactions and then index and search them using FAISS. This will allow you to find and retrieve information at a later date. Every memory is stored on disk to allow the agent to maintain its memory between sessions. Click here to see the FULL CODES here.
Def Now_iso(): return datetime.now().isoformat(timespec="seconds")
def clamp(txt, n=1600): return txt if len(txt)self.max_turns:
convo="n".join([f"{r}: {t}" for r,t in self.turns])
s=self._gen(SUMMARIZE_PROMPT(clamp(convo, 3500)), max_new_tokens=180, temp=0.2)
self.summary=s; self.turns=self.turns[-4:]
def recall(self, query, k=5):
hits=self.mem.search(query, k=k)
You can return to your original language by clicking here. "n".join([f"- ({d:.2f}) {h['text']} [meta={h['meta']}]" for d,h in hits])
Ask yourself (e.g., "self" or "user"):
self.turns.append(("user", user))
saved, memline = self._distill_and_store(user)
mem_ctx=self.recall(user, k=6)
prompt=self._chat_prompt(user, mem_ctx)
reply=self._gen(prompt)
self.turns.append(("assistant", reply))
self._maybe_summarize()
status=f"💾 memory_saved: {saved}; " + (f"note: {memline}" If saved, else "note: -")
print(f"nUSER: {user}nASSISTANT: {reply}n{status}")
Return reply
MemoryAgent brings it all together. This agent will generate context sensitive responses, store important information in long-term memories, and summarize the conversation to keep it short-term. This setup allows us to create an assistant who remembers and recalls our conversations with it, as well as adapts. Visit the FULL CODES here.
agent=MemoryAgent()
print("✅ Agent ready. Try these:n")
agent.ask("Hi! My name is Nicolaus, I prefer being called Nik. I'm preparing for UPSC in 2027.")
agent.ask("Also, I work at Visa in analytics and love concise answers.")
agent.ask("What's my exam year and how should you address me next time?")
agent.ask("Reminder: I like agentic RAG tutorials with single-file Colab code.")
agent.ask("Given my prefs, suggest a study focus for this week in one paragraph.")
Instantiate MemoryAgent, and send a couple of messages immediately to test its recall and help it form long-term associations. The MemoryAgent adapts to the conciseness of our replies, remembers what we prefer to call it and which exam year. We also confirm that past preferences are remembered (agenttic RAG or single-file Colab).
As a conclusion, it’s amazing how much power we can get from giving our AI Agent the capability to remember. Our agent now stores important information, retrieves it when necessary, and summarises conversations in order to remain efficient. This keeps the interaction contextual, and evolves with every exchange. This foundation allows us to expand memory, experiment with richer schemas, or create more advanced designs for memory-augmented agents.
Take a look at the FULL CODES here. Check out our website to learn more. GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe Now our Newsletter.
Asif Razzaq serves as the CEO at Marktechpost Media Inc. As an entrepreneur, Asif has a passion for harnessing Artificial Intelligence to benefit society. Marktechpost was his most recent venture. This platform, which focuses on machine learning and deep-learning news, is both technical and understandable to a broad audience. This platform has over 2,000,000 monthly views which shows its popularity.

