Close Menu
  • AI
  • Content Creation
  • Tech
  • Robotics
AI-trends.todayAI-trends.today
  • AI
  • Content Creation
  • Tech
  • Robotics
Trending
  • OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders
  • Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika
  • TabPFN’s superior accuracy on tabular data sets is achieved by leveraging in-context learning compared to Random Forest or CatBoost
  • Moonshot AI Researchers and Tsinghua Researchers propose PrfaaS, a cross-datacenter KVCache architecture that rethinks how LLMs can be served at scale.
  • OpenMythos – A PyTorch Open Source Reconstruction of Claude Mythos, where 770M Parameters match a 1.3B Transformator
  • This tutorial will show you how to run PrismML Bonsai 1Bit LLM using CUDA, Benchmarking and Chat with JSON, RAG, GGUF.All 128 weights have the same FP16 scaling factor. 1 bit (sign) + 16/128 bits (shared scale) = 1.125 bpw Compare Memory for Bonsai 1.7B:?It is 14.2 times smaller than Q1_0_g128!
  • NVIDIA Releases Ising – the First Open Quantum AI Model Family For Hybrid Quantum-Classical Systems
  • xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers
AI-trends.todayAI-trends.today
Home»Tech»Google AI releases LangExtract, an open source Python library that extracts structured data from unstructured text documents

Google AI releases LangExtract, an open source Python library that extracts structured data from unstructured text documents

Tech By Gavin Wallace05/08/20254 Mins Read
Facebook Twitter LinkedIn Email
NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning
NVIDIA Releases Llama Nemotron Nano 4B: An Efficient Open Reasoning
Share
Facebook Twitter LinkedIn Email

In today’s data-driven world, valuable insights are often buried in unstructured text—be it clinical notes, lengthy legal contracts, or customer feedback threads. The challenge of extracting useful, traceable data from such documents is both technical and practical. Google AI’s open-source Python Library LangExtractThe aims to close this gap by using LLMs, such as Gemini. Its core is traceability, transparency, and powerful automated extraction.

1. The Declaration of Traceable and Declarative Extraction

LangExtract allows users to define custom extraction tasks by using high-quality natural language instructions. “few-shot” examples. It allows developers to create and analyze code. You must specify the exact entities, relationships or facts that you wish to extract.. Importantly, each piece of extracted information must be verified. It is a direct link to the original text—enabling validation, auditing, The following are some examples of how to get started: end-to-end traceability.

2. Domain Versatility

The library works not just in tech demos but in critical real-world domains—including health (clinical notes, medical reports), finance (summaries, risk documents), law (contracts), research literature, and even the arts (analyzing Shakespeare). Original use cases include automatic extraction of medications, dosages, and administration details from clinical documents, as well as relationships and emotions from plays or literature.

3. Schema Enforcement with LLMs

Powered by Gemini and compatible with other LLMs, LangExtract enables enforcement of custom output schemas (like JSON), so results aren’t just accurate—they’re immediately usable in downstream databases, analytics, or AI pipelines. It solves traditional LLM weaknesses around hallucination and schema drift by grounding outputs to both user instructions and actual source text.

4. Scalability and Visualization

  • Handles Large Volumes: LangExtract efficiently processes long documents by chunking, parallelizing, and aggregating results.
  • Interactive Visualization: Developers can generate interactive HTML reports, viewing each extracted entity with context by highlighting its location in the original document—making auditing and error analysis seamless.
  • Smooth Integration: Works in Google Colab, Jupyter, or as standalone HTML files, supporting a rapid feedback loop for developers and researchers.

5. Installation and Usage

Install easily with pip:

Example Workflow (Extracting Character Info from Shakespeare):

import langextract as lx
import textwrap

# 1. Define your prompt
prompt = textwrap.dedent("""
Extract characters, emotions, and relationships in order of appearance.
Use exact text for extractions. Do not paraphrase or overlap entities.
Provide meaningful attributes for each entity to add context.
""")

# 2. Give a high-quality example
examples = [
    lx.data.ExampleData(
        text="ROMEO. But soft! What light through yonder window breaks? It is the east, and Juliet is the sun.",
        extractions=[
            lx.data.Extraction(extraction_class="character", extraction_text="ROMEO", attributes={"emotional_state": "wonder"}),
            lx.data.Extraction(extraction_class="emotion", extraction_text="But soft!", attributes={"feeling": "gentle awe"}),
            lx.data.Extraction(extraction_class="relationship", extraction_text="Juliet is the sun", attributes={"type": "metaphor"}),
        ],
    )
]

# 3. Extract from new text
input_text = "Lady Juliet gazed longingly at the stars, her heart aching for Romeo"

result = lx.extract(
    text_or_documents=input_text,
    prompt_description=prompt,
    examples=examples,
    model_id="gemini-2.5-pro"
)

# 4. Save and visualize results
lx.io.save_annotated_documents([result], output_name="extraction_results.jsonl")
html_content = lx.visualize("extraction_results.jsonl")
with open("visualization.html", "w") as f:
    f.write(html_content)

This results in structured, source-anchored JSON outputs, plus an interactive HTML visualization for easy review and demonstration.

Specialized & Real-World Applications

  • Medicine: Extracts medications, dosages, timing, and links them back to source sentences. Powered by insights from research conducted on accelerating medical information extraction, LangExtract’s approach is directly applicable to structuring clinical and radiology reports—improving clarity and supporting interoperability.
  • Finance & Law: Automatically pulls relevant clauses, terms, or risks from dense legal or financial text, ensuring every output can be traced back to its context.
  • Research & Data Mining: Streamlines high-throughput extraction from thousands of scientific papers.

The team even provides a demonstration called RadExtract for structuring radiology reports—highlighting not just what was extracted, but exactly where the information appeared in the original input.

How LangExtract Compares

Feature Traditional Approaches LangExtract Approach
Schema Consistency Often manual/error-prone Enforced via instructions & few-shot examples
Result Traceability Minimal All output linked to input text
Scaling to Long Texts Windowed, lossy Chunked + parallel extraction, then aggregation
Visualization Custom, usually absent Built-in, interactive HTML reports
Deployment Rigid, model-specific Gemini-first, open to other LLMs & on-premises

In Summary

LangExtract presents a new era for extracting structured, actionable data from text—delivering:

  • Declarative, explainable extraction
  • Traceable results backed by source context
  • Instant visualization for rapid iteration
  • Easy integration into any Python workflow

Check out the GitHub Page and Technical Blog. Check out our website to learn more. GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe Now our Newsletter.


Asif Razzaq serves as the CEO at Marktechpost Media Inc. As an entrepreneur, Asif has a passion for harnessing Artificial Intelligence’s potential to benefit society. Marktechpost was his most recent venture. This platform, known as an Artificial Intelligence Media Platform (AIMP), is known for its comprehensive coverage of deep learning and machine learning. This platform has over 2,000,000 monthly views which shows its popularity.

AI dat data Google x
Share. Facebook Twitter LinkedIn Email
Avatar
Gavin Wallace

Related Posts

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

20/04/2026

Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika

20/04/2026

TabPFN’s superior accuracy on tabular data sets is achieved by leveraging in-context learning compared to Random Forest or CatBoost

20/04/2026

Moonshot AI Researchers and Tsinghua Researchers propose PrfaaS, a cross-datacenter KVCache architecture that rethinks how LLMs can be served at scale.

20/04/2026
Top News

Nvidia will spend $26 billion to build open-weight AI models, filings show

Senators Want to Know how Much Energy Data Centers Use

Riley Walz joins OpenAI, the Silicon Valley jester.

Meet the Chinese Startup Using AI—and a Small Army of Workers—to Train Robots

Character.AI Has given up on AGI. The Storytelling is the new AGI.

Load More
AI-Trends.Today

Your daily source of AI news and trends. Stay up to date with everything AI and automation!

X (Twitter) Instagram
Top Insights

AI is being used to falsely identify the federal agent who shot Renee Good

08/01/2026

America’s largest bitcoin miners are shifting to AI

09/12/2025
Latest News

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

20/04/2026

Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika

20/04/2026
X (Twitter) Instagram
  • Privacy Policy
  • Contact Us
  • Terms and Conditions
© 2026 AI-Trends.Today

Type above and press Enter to search. Press Esc to cancel.