The AI industry, which is highly regulated and has high stakes at play, operates under the assumption that flexibility is key. Because AI models are constantly changing, we build GPUs that have a general purpose. We also need silicon that is programmable and can be adapted to the latest research.
The following are some of the ways to get in touch with each other TaalasAccording to the Toronto-based start-up, flexibility is what holds back AI. According to Taalas team, if we want AI to be as common and cheap as plastic, we have to stop ‘simulating’ intelligence on general-purpose computers and start ‘casting’ it directly into silicon.
The Problem: The ‘Memory Wall’ and the GPU Tax
A physical bottleneck is driving the current costs of operating a Large Language Model. Memory Wall.
Traditional processors (GPUs) are ‘Instruction Set Architecture’ (ISA) based. Separate memory from compute. In order to perform an inference run on models like Llama-3 the processor spends a large amount of time and energy moving data from High Bandwidth Memory. This ‘data movement tax’ accounts for nearly 90% of the power consumption in modern AI data centers.
Taalas s radical solution: Remove the memory retrieval cycle. Taalas uses a proprietary design flow to translate the computation graph of a model into the actual layout of the chip. The authors describe how they use their proprietary automated design flow to translate the computational graph of a specific model directly into the physical layout of a chip. HC1 The weights of (Hardcore 1 chip) and the architecture is literally imprinted into the wiring.
Hardcore Models: 17,000 Tokens Per Second
The results of this ‘direct-to-silicon’ approach redefine the performance ceiling for inference. Taalas recently demonstrated its latest technology. HC1 Running a Llama 3 8B model. A top-tier NVIDIA GPU H100 could serve just one user, at 150 tokens per sec. The HC1 can handle a staggering 30,000 users. 16 to 17 tokens per second.
This changes the ‘unit economics’ of AI:
- Performance: One HC1 chip is capable of outperforming a small GPU Data Center in terms raw throughput.
- Efficiency: Taalas claims a 1000x improvement in efficiency (performance-per-watt and performance-per-dollar) compared to conventional chips.
- Infrastructure: The weights can be hardwired so there’s no need to use external HBM systems or complicated liquid cooling. These 250W cards are housed on a standard air-cooled server rack. This gives you the power to run an entire GPU Cluster in one box.
Automated Foundry Breaks the 60-Day Limit
The obvious ‘catch’ for an AI developer is flexibility. What happens if a new model is released tomorrow and you have hard-wired a particular model onto a chip? Historically, designing an ASIC (Application-Specific Integrated Circuit) took two years and tens of millions of dollars.
Taalas is the one who has figured this out. You can also find out more about the automation of your home.. The system is similar to a foundry compiler that can generate a chip in a matter of a few days. By focusing on a streamlined manufacturing workflow—where they only change the top metal masks of the silicon—they have collapsed the turnaround time from ‘weights-to-silicon’ to just Two Months.
This allows for a ‘seasonal’ hardware cycle. In the spring, a company can fine-tune its frontier model and deploy thousands of hyper-efficient, specialized inference chips by summer.

Stamps are now the dominant market, replacing shovels
The AI hype cycle is at a turning point. We are moving from the ‘Research & Training’ phase—where GPUs are essential for their flexibility—to the ‘Deployment & Inference’ phase, where cost-per-token is the only metric that matters.
The AI market could split in two tiers if Taalas is successful:
- General-Purpose Training: NVIDIA, AMD and others provide the flexible, massive clusters required to train and discover new architectures.
- Specialized Inference Led by ‘foundries’ like Taalas, which take those proven architectures and ‘print’ them into cheap, ubiquitous silicon for everything from smartphones to industrial sensors.
The Key Takeaways
- The ‘Hardwired’ Paradigm Shift: Taalas has moved from Software-defined AI Running models on GPUs for general purpose is not recommended. A hardware-defined artificial intelligence. By ‘baking’ a specific model’s weights and architecture directly into the silicon, they eliminate the need for traditional instruction-set overhead, effectively making the model the processor itself.
- The Death of Memory Wall Taalas’s AI technology is 90% efficient. Taalas’s HC1 (Hardcore 1) Chip eliminates the “Memory Wall” By physically connecting the model parameters to the metal layers of the chip, the expensive High Bandwidth Memory is no longer needed.
- 1,000x Improvement in Efficiency: By stripping away the ‘programmability tax’, Taalas claims a 1,000x improvement in performance-per-watt and performance-per-dollar. The HC1 has a maximum of 450 watts. 17,000 tokens per second on a Llama 3.1 8B model—massively outperforming a standard GPU rack while using far less power.
- Automated ‘Direct-to-Silicon’ Foundry: Taalas has developed a proprietary system to solve the issue of model obsolescence. Automatic Design Flow. It takes just a couple of weeks to develop a customized AI chip, instead of years. Weeks, allowing companies to ‘print’ their fine-tuned models into silicon on a seasonal basis.
- Commodity Artificial Intelligence Future: This technology signals a shift from ‘Cloud-First’ to ‘Device-Native’ AI. As inference becomes a cheap, hardwired commodity, AI will move off centralized servers and into local, low-power hardware—ranging from smartphones to industrial sensors—with zero latency and no subscription costs.
Take a look at the Technical details. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.


