Code Implementation for AI Agents with Automated Validation and Live Python Execution

We will learn how to leverage the advanced AI Agent’s Python execution capability and its result validation capabilities to handle complex computational tasks. We build an end to end solution by integrating LangChain’s ReAct agent with Anthropic’s Claude API. This allows us to create Python code, execute it, record its outputs and maintain the execution state. The results are then automatically verified against test cases or expected properties. This loop allows for a seamless flow between the two. “write → run → validate” This tool allows you to create robust analysis, simple ML-pipelines, and algorithms with confidence at every stage.

Install langchain, langchain core anthropic or langchain.

We install the core LangChain framework along with the Anthropic integration and its core utilities, ensuring you have both the agent orchestration tools (langchain, langchain-core) and the Claude-specific bindings (langchain-anthropic, anthropic) available in your environment.

Import os
From langchain.agents, import AgentExecutor
From langchain.tools, import Tool
from langchain_core.prompts import PromptTemplate
ChatAnthropic can be imported from langchain_anthropic
Import sys
Import io
Import Re
Download json
Import Dict, Any or List

Anthropic ChatAnthropic, which connects to Claude, is also included. The standard Python modules (sys. io. re. json.) take care of I/O, serialization and regular expressions. Typing provides code that is easier to maintain.

PythonREPLTool class:
    def __init__(self):
        self.globals_dict = {
            '__builtins__': __builtins__,
            'json': json,
 "Re":
        }
        self.locals_dict = {}
        self.execution_history = []
   
    def run(self, code: str) -> str:
        try:
            old_stdout = sys.stdout
            old_stderr = sys.stderr
            sys.stdout = captured_output = io.StringIO()
            sys.stderr = captured_error = io.StringIO()
           
 Execution_result is None
           
            try:
 result = (eval code, self.globals_dict self.locals_dict);
 Execution_result = results
 If result is None, then:
                    print(result)
 Except SyntaxError
                exec(code, self.globals_dict, self.locals_dict)
           
            output = captured_output.getvalue()
            error_output = captured_error.getvalue()
           
            sys.stdout = old_stdout
            sys.stderr = old_stderr
           
            self.execution_history.append({
                'code': code,
                'output': output,
                'result': execution_result,
                'error': error_output
            })
           
 Response = f"**Code Executed:**n```pythonn{code}n```nn"
            if error_output:
 Response = f"**Errors/Warnings:**n{error_output}nn"
 Response = f"**Output:**n{output if output.strip() else 'No console output'}"
           
 If execution_result does not equal None or output.strip():
 Response += f"n**Return Value:** {execution_result}"
           
 Return response
           
 Exception to the rule:
            sys.stdout = old_stdout
            sys.stderr = old_stderr
           
            error_info = f"**Code Executed:**n```pythonn{code}n```nn**Runtime Error:**n{str(e)}n**Error Type:** {type(e).__name__}"
           
            self.execution_history.append({
                'code': code,
                'output': '',
                'result': None,
                'error': str(e)
            })
           
            return error_info
   
    def get_execution_history(self) -> List[Dict[str, Any]]:
        return self.execution_history
   
    def clear_history(self):
        self.execution_history = []

This PythonREPLTool encapsulates a stateful in‐process Python REPL: it captures and executes arbitrary code (evaluating expressions or running statements), redirects stdout/stderr to record outputs and errors, and maintains a history of each execution. This formatted summary includes all the information about each snippet, such as any errors or console output. It also contains return values.

class ResultValidator
    def __init__(self, python_repl: PythonREPLTool):
        self.python_repl = python_repl
   
    def validate_mathematical_result(self, description: str, expected_properties: Dict[str, Any]) -> str:
        """Validate mathematical computations"""
 Validation_code: f"""
# Validation for: {description}
validation_results = {{}}


# Return the results of last execution
history = {self.python_repl.execution_history}
if you want to know more about history, click here.
    last_execution = history[-1]
    print(f"Last execution output: {{last_execution['output']}}")
   
 # Numbers from output
 Import Re
    numbers = re.findall(r'd+(?:.d+)?', last_execution['output'])
 If numbers
 Numbers = [float(n) for n in numbers]
        validation_results['extracted_numbers'] The number is
       
        # Validate expected properties
        for prop, expected_value in {expected_properties}.items():
 If prop >= "count", then:
                actual_count = len(numbers)
                validation_results[f'count_check'] = actual_count == expected_value
                print(f"Count validation: Expected {{expected_value}}, Got {{actual_count}}")
 If prop >= "max_value", then elif the value is a maximum.
 Numbers:
                    max_val = max(numbers)
                    validation_results[f'max_check'] = max_val = expected_value
                    print(f"Min value validation: {{min_val}} >= {{expected_value}} = {{min_val >= expected_value}}")
 If prop is a sum_range, then elif it == "sum_range":
 If numbers
                    total = sum(numbers)
                    min_sum, max_sum = expected_value
                    validation_results[f'sum_check'] = min_sum  str:
        """Validate data analysis results"""
 Validation_code: f"""
# Data Analysis Validation for: {description}
validation_results = {{}}


# Verify if the required variables are in global scope
required_vars = {list(expected_structure.keys())}
existing_vars = []


For var_name, see required_vars
    if var_name in globals():
        existing_vars.append(var_name)
 var_value = Globals()[var_name]
        validation_results[f'{{var_name}}_exists'] = True
        validation_results[f'{{var_name}}_type'] = type(var_value).__name__
       
 Validations by type
        if isinstance(var_value, (list, tuple)):
            validation_results[f'{{var_name}}_length'] = len(var_value)
        elif isinstance(var_value, dict):
            validation_results[f'{{var_name}}_keys'] = list(var_value.keys())
        elif isinstance(var_value, (int, float)):
            validation_results[f'{{var_name}}_value'] = var_value
           
        print(f"✓ Variable '{{var_name}}' found: {{type(var_value).__name__}} = {{var_value}}")
    else:
        validation_results[f'{{var_name}}_exists'] True
        print(f"✗ Variable '{{var_name}}' not found")


print(f"nFound {{len(existing_vars)}}/{{len(required_vars)}} required variables")


Structure validation #
for var_name, expected_type in {expected_structure}.items():
    if var_name in globals():
        actual_type = type(globals()[var_name]).__name__
        validation_results[f'{{var_name}}_type_match'] = actual_type == expected_type
        print(f"Type check '{{var_name}}': Expected {{expected_type}}, Got {{actual_type}}")


validation_results
"""
        return self.python_repl.run(validation_code)
   
    def validate_algorithm_correctness(self, description: str, test_cases: List[Dict[str, Any]]) -> str:
        """Validate algorithm implementations with test cases"""
 Validation_code: f"""
# Algorithm Validation for: {description}
validation_results = {{}}
test_results = []


test_cases = {test_cases}


for i, test_case in enumerate(test_cases):
    test_name = test_case.get('name', f'Test {{i+1}}')
    input_val = test_case.get('input')
    expected = test_case.get('expected')
    function_name = test_case.get('function')
   
    print(f"nRunning {{test_name}}:")
    print(f"Input: {{input_val}}")
    print(f"Expected: {{expected}}")
   
    try:
 If globals and if function_name are in the same function():
            func = globals()[function_name]
            if callable(func):
                if isinstance(input_val, (list, tuple)):
                    result = func(*input_val)
                else:
 Results = (func(input_val);
               
 Pass = expected = = results
                test_results.append({{
                    'test_name': test_name,
                    'input': input_val,
                    'expected': expected,
                    'actual': result,
                    'passed': passed
                }})
               
 Status = "✓ PASS" If not passed then "✗ FAIL"
                print(f"Actual: {{result}}")
                print(f"Status: {{status}}")
            else:
                print(f"✗ ERROR: '{{function_name}}' is not callable")
        else:
            print(f"✗ ERROR: Function '{{function_name}}' not found")
           
 Except Exception As e.
        print(f"✗ ERROR: {{str(e)}}")
        test_results.append({{
            'test_name': test_name,
            'error': str(e),
 'Passed' is false
        }})


# Summary
Test.get("passed", False), test.get(1) = passed_tests
total_tests = len(test_results)
validation_results['tests_passed'] = passed_tests
validation_results['total_tests'] = total_tests
validation_results['success_rate'] = passed_tests / total_tests if total_tests > 0 else 0


print(f"n=== VALIDATION SUMMARY ===")
print(f"Tests passed: {{passed_tests}}/{{total_tests}}")
print(f"Success rate: {{validation_results['success_rate']:.1%}}")


test_results
"""
        return self.python_repl.run(validation_code)

The ResultValidator builds upon the PythonREPLTool class to generate and execute bespoke validation routines. These can include checking numeric properties, validating data structures or running algorithm tests against the agent’s execution history. The loop is closed by generating Python code that compares outputs to the expected criteria and summarises pass/fail outcomes. “execute → validate” Our agents’ workflow is a great way to streamline your work.

PythonREPLTool = python_repl()
validator = ResultValidator(python_repl)

We instantiate the interactive Python REPL (python_repl), and create a ResultValidator that is tied to this same REPL. This ensures that the code you run is instantly available to be validated by automated steps. It closes the loop of execution and checks for correctness.

python_tool = Tool(
    name="python_repl",
    description="Execute Python code and return both the code and its output. Maintains state between executions.",
    func=python_repl.run
)


validation_tool = Tool(
    name="result_validator",
    description="Validate the results of previous computations with specific test cases and expected properties.",
    func=lambda query: validator.validate_mathematical_result(query, {})
)

The LangChain Tool wraps our REPL methods and validation methods, giving them clear names, descriptions, and additional information. Agents can use python_repl for code execution and result_validator in order to validate the results of last execution.

prompt_template = ""You are Claude an AI assistant who can execute Python code and validate results.


The Python language can be used to solve complicated problems. You should then verify your results for accuracy.


Tools available:
{tools}


You can use the following format:
You must answer the following question
Think: what should be done?
Action: {tool_names}
Take Action: [your input]
Observation: [result]
... (repeat Thought/Action/Action Input/Observation as needed)
Validate your results
Action: [validation if needed]
Take Action: [validation parameters]
Observation: [validation results]
Now I have the answer.
Last Answer: [comprehensive answer with validation confirmation]


Question: {input}
{agent_scratchpad}"""


prompt = PromptTemplate(
    template=prompt_template,
    input_variables=["input", "agent_scratchpad"],
    partial_variables={
        "tools": "python_repl - Execute Python codenresult_validator - Validate computation results",
        "tool_names": "python_repl, result_validator"
    }
)

Above template frames Claude is a dual-capability Assistant that (“Thought”The agent uses placeholders for tool names and their usage examples to guide the agent through a clear chain-of-thought structure. It guides the agent by defining a chain of thought with placeholders to name the tools and examples. “Final Answer.” The scaffolding is designed to ensure a disciplined work environment. “write → run → validate” workflow.

class AdvancedClaudeCodeAgent:
    def __init__(self, anthropic_api_key=None):
 If anthropic_api_key
            os.environ["ANTHROPIC_API_KEY"] = anthropic_api_key
       
        self.llm = ChatAnthropic(
            model="claude-3-opus-20240229",
            temperature=0,
            max_tokens=4000
        )
       
        self.agent = create_react_agent(
            llm=self.llm,
            tools=[python_tool, validation_tool],
            prompt=prompt
        )
       
        self.agent_executor = AgentExecutor(
            agent=self.agent,
            tools=[python_tool, validation_tool],
            verbose=True,
            handle_parsing_errors=True,
            max_iterations=8,
            return_intermediate_steps=True
        )
       
        self.python_repl = python_repl
 Self-validator = Validator
   
    def run(self, query: str) -> str:
        try:
            result = self.agent_executor.invoke({"input": query})
 Return to result["output"]
 Exception to the rule:
 The return of f"Error: {str(e)}"
   
    def validate_last_result(self, description: str, validation_params: Dict[str, Any]) -> str:
        """Manually validate the last computation result"""
 If validation_params contains 'test_cases,' then:
            return self.validator.validate_algorithm_correctness(description, validation_params['test_cases'])
 Elif validation_params 'expected_structure:Return
            return self.validator.validate_data_analysis(description, validation_params['expected_structure'])
        else:
            return self.validator.validate_mathematical_result(description, validation_params)
   
    def get_execution_summary(self) -> Dict[str, Any]:
        """Get summary of all executions"""
        history = self.python_repl.get_execution_history()
        return {
            'total_executions': len(history),
            'successful_executions': len([h for h in history if not h['error']]),
            'failed_executions': len([h for h in history if h['error']]),
 "execution_details": History
        }

This AdvancedClaudeCodeAgent class wraps everything into a single, easy-to-use interface: it configures the Anthropic Claude client (using your API key), instantiates a ReAct-style agent with our python_repl and result_validator tools and the custom prompt, and sets up an executor that drives iterative “think → code → validate” loops. The loop() This method allows you to enter natural-language questions and receive Claude’s last, self-checked result; valid_last_result() exposes manual hooks for additional checks; and get_execution_summary() This report shows you how many code fragments have been executed successfully, or failed (and their specifics).

If the __name__ equals "__main__":
    API_KEY = "Use Your Own Key Here"
   
    agent = AdvancedClaudeCodeAgent(anthropic_api_key=API_KEY)
   
    print("🚀 Advanced Claude Code Agent with Validation")
    print("=" * 60)
   
    print("n🔢 Example 1: Prime Number Analysis with Twin Prime Detection")
    print("-" * 60)
    query1 = """
 Then, find all primes between 1 and 200.
    1. Calculate the total amount
    2. All primes differing by 2 digits are twin primes.
    3. Calculate average gaps between primes
    4. The largest prime in the range is identified.
 Validate that you found all prime numbers and the number we calculated is correct.
    """
    result1 = agent.run(query1)
    print(result1)
   
    print("n" + "=" * 80 + "n")
   
    print("📊 Example 2: Advanced Sales Data Analysis with Statistical Validation")
    print("-" * 60)
    query2 = """
 Create an extensive sales analysis
    1. Use realistic seasonal patterns to generate sales data across 12 products for 24 months.
    2. Calculate monthly and yearly growth rates. Trend analysis
    3. Top 3 products that perform best and the worst performing 3 products
    4. Perform correlation analysis between different products
    5. Summary statistics: mean, median, standard deviation and percentiles
 After the analysis, verify that the data is correct and validate its structure.
    """
    result2 = agent.run(query2)
    print(result2)
   
    print("n" + "=" * 80 + "n")
   
    print("⚙️ Example 3: Advanced Algorithm Implementation with Test Suite")
    print("-" * 60)
    query3 = """
 Implement and verify a comprehensive search and sorting systems:
    1. Implement binary, quicksort and mergesort search algorithms
    2. Test data can be created with different edge cases. (Empty lists, one element, duplicates and sorted/reversed sorted).
    3. Benchmark performance of various sorting algorithms
    4. Find the largest element kth using various approaches
    5. All implementations should be tested with comprehensive tests, including edge cases.
 Validate each algorithm using multiple test cases after implementation to verify correctness.
    """
    result3 = agent.run(query3)
    print(result3)
   
    print("n" + "=" * 80 + "n")
   
    print("🤖 Example 4: Machine Learning Model with Cross-Validation")
    print("-" * 60)
    query4 = """
 Complete machine learning pipeline build:
    1. Generating a synthesized dataset with target variables and features (classification problems)
    2. Implement data preprocessing (normalization, feature scaling)
    3. Create a linear classifier (gradient descent).
    4. Split data into train/validation/test sets
    5. Training the model, and evaluating performance (accuracy precision recall).
    6. Cross-validation using k-folds
    7. Compare different hyperparameters to compare results
 Validate your entire pipeline to ensure that gradient descent is mathematically correct, data splits are accurate, and performance metrics are realistic.
    """
    result4 = agent.run(query4)
    print(result4)
   
    print("n" + "=" * 80 + "n")
   
    print("📋 Execution Summary")
    print("-" * 60)
    summary = agent.get_execution_summary()
    print(f"Total code executions: {summary['total_executions']}")
    print(f"Successful executions: {summary['successful_executions']}")
    print(f"Failed executions: {summary['failed_executions']}")
   
 If summary['failed_executions'] > 0:
        print("nFailed executions details:")
 Execution of i in the enumerate (summary)['execution_details']):
 If you execute['error']:
                print(f"  {i+1}. Error: {execution['error']}")
   
    print(f"nSuccess rate: {(summary['successful_executions']/summary['total_executions']*100):.1f}%")

Finally, we instantiate the AdvancedClaudeCodeAgent with your Anthropic API key, run four illustrative example queries (covering prime‐number analysis, sales data analytics, algorithm implementations, and a simple ML The agent will print out each result. The agent displays and gathers a summary that includes the total number of runs, success or failure details and any errors. “write → run → validate” workflow.

In conclusion, we have developed a versatile AdvancedClaudeCodeAgent capable of seamlessly blending generative reasoning with precise computational control. This Agent does not just write Python code; it also runs the code on-the-fly and verifies its correctness based on your defined criteria. The feedback loop is closed automatically. This pattern can be used for prime-number calculations, benchmarking algorithms, or continuous ML workflows.

Click here to find out more Notebook on GitHub. The researchers are the sole owners of all credit. Also, feel free to follow us on Twitter Don’t forget about our 95k+ ML SubReddit Subscribe Now our Newsletter.

Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to harnessing Artificial Intelligence’s potential for the social good. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. Over 2 million views per month are a testament to the platform’s popularity.

Code Implementation for AI Agents with Automated Validation and Live Python Execution

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

‘Uncanny Valley’: ICE’s Secret Expansion Plans, Palantir Workers’ Ethical Concerns, and AI Assistants

The new Bernie Sanders AI Safety Bill would halt data center construction

Amazon Workers Issue Warning About Company’s ‘All-Costs-Justified’ Approach to AI Development

Craigslist is the only real online marketplace?

Why are Therapists so Expensive? ChatGPT: Why are thousands of women spilling their deepest secrets?

Top Insights

What it means to have a strong one and how you can do so

How to use Gemini’s Gemini API with Context Circulation and Parallel Tool IDs for a combination of Google Search and Google Maps.

Latest News

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

Code Implementation for AI Agents with Automated Validation and Live Python Execution

Related Posts