NVIDIA’s VIBETENSOR is an open-source software stack designed for deep learning. VIBETENSOR was generated by LLM powered coding agents with high-level guidance from humans.
The system is asking a direct question: are coding agents able to generate a coherence deep learning runtime, that includes Python and JavaScript runtime components down to C++ memory management and C++ runtime component and can they validate the runtime only using tools?
Architecture: frontends through to CUDA runtime
VIBETENSOR is a PyTorch style eager tensor with a C++20 CPU core and CUDA. It also has a nanobind overlay that mimics the torch interface, as well as an experimental Node.js/TypeScript interface. It is designed to target Linux x86_64 GPUs and NVIDIA via CUDA. Builds without CUDA were intentionally disabled.
This core stack contains its own tensors and storage systems, a Schema-lite dispatcher and reverse-mode autograd, a CUDA-subsystem that includes streams, CUDA graphs and events, a caching-allocator stream-ordered with diagnostics as well as a stable C-ABI for dynamically-loaded operator plugins. Both Node.js and Python frontends use the same C++ dispatcher and tensor implementation.
Python Overlay – a revealing tool vibetensor.torch Namespace with tensor factory, operator dispatch and CUDA utility. Node.js is built using Node-API, and focuses on async performance. Worker scheduling limits the amount of concurrent work in flight.
Runtime is a measure of the time taken to complete a task. TensorImpl A view of the reference counted StorageThe shared version counter includes device metadata and sizes. The non-contiguous and aliased views are supported. This supports non-contiguous views and aliasing. TensorIterator Iteration shapes are computed by the subsystem for reduction and elementwise operators. This logic is also exposed to external kernels through ABI plugins, which follow similar aliasing/iteration rules.
It is schema-lite. The dispatcher maps operator names across CPU and CUDA keys to their implementations and provides wrapper layers that allow autograd or Python overrides. Device policies enforce certain invariants. “all tensor inputs on the same device,” While leaving space for multi-device policies.
Autograd, CUDA, and Multi-GPU Fabric
The reverse-mode autograd is based on Node-Edge graph objects as well as per-tensor AutogradMeta. The engine keeps dependency counts and gradient buffers per input during backward. It also maintains a queue of ready tasks. In order to synchronize the gradients across all streams, it waits and records CUDA tensor events. It also includes an experimental mode that allows for cross-device execution.

The CUDA Subsystem includes C++ wrappers of CUDA events and streams, a cache allocator that has stream-ordered meanings, as well as CUDA graph recording and replay. Allocator diagnostics include snapshots, statistics and memory-fraction limits, as well as GC ladders. This allows memory behavior to be observed in testing and debugging. CUDA graphs are integrated with the allocator “graph pools” Memory management across capture, replay and recording.
Fabric is a multi-GPU experimental layer. When the topology permits, it exposes explicit GPU access peer-topeer via CUDA-P2P. Fabric is primarily a single-process, multi-GPU implementation that provides primitives for observing events and statistics.
VIBETENSOR provides a CUTLASS ring allreduce extension for NVIDIA’s Blackwell GPUs as a standard reference. This plugin is an example of a ring-allreduce plug-in that does not use NCCL and works as a reference extension. In the paper, multi-GPU results rely on Fabric with this optional plug-in and are only available for Blackwell graphics cards.
The extension of the points for interoperability
VIBETENSOR offers DLPack support for import and export of CPU tensors, as well as a C++20 safetensors saver and loader for serialization. Python level overrides, inspired by torch.libraryThe plugin ABI exposes DLPack-based dtype and device metadata and includes hooks to enable custom GPU kernels written in Triton or CUDA templates libraries, such as CUTLASS. The plugin ABI exposes DLPack based dtype metadata. TensorIterator Helpers are used to integrate external kernels with the same iteration rules and aliasing as operators built in.
AI-assisted development
VIBETENSOR is a product that was developed using LLM powered coding agents. The main authors of the code were only guided by human high-level specs. Over a period of approximately 2 months humans established targets and constraint, agents then suggested code diffs. They also executed builds and tested them to verify. This work doesn’t introduce a brand new framework for agents, but instead treats them as tools which modify codebases under the tool-based check. Validation is based on C++ (CTest), Python via pytest and differential checking against reference implementations, such as PyTorch. Research team includes longer training regressions, allocator diagnostics, and CUDA diagnoses in order to detect stateful bugs or performance pathologies not detected by unit tests.
What you need to know
- Deep learning stack based on AI and CUDAVIBETENSOR, an Apache 2.0 open-source PyTorch style eager runtime, was generated by LLM coding agent, and targets Linux x86_64, with NVIDIA graphics cards, and CUDA, as a requirement.
- Not just Kernels, but Full Runtime Architecture: The system includes a C++20 tensor core (TensorImpl/Storage/TensorIterator), a schema-lite dispatcher, reverse-mode autograd, a CUDA subsystem with streams, events, graphs, a stream-ordered caching allocator, and a versioned C plugin ABI, exposed through Python (
vibetensor.torchNode.js and other experimental frontends. - Tool-driven, agent-centric development workflowOver the course of 2 month, humans defined high-level objectives, whereas agents presented diffs. They validated these via CTest and pytest and differential checks with PyTorch.
- Faster microkernel training, but slower overall training: AI-generated kernels in Triton/CuTeDSL achieve up to ~5–6× speedups over PyTorch baselines in isolated benchmarks, but complete training workloads (Transformer toy tasks, CIFAR-10 ViT, miniGPT-style LM) run 1.7× to 6.2× slower than PyTorch, emphasizing the gap between kernel and system-level performance.
Click here to find out more Paper The following are some examples of how to get started: Repo here. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.


