Meet Talkie-1930, a 13B LLM Open-Weight Trained in Pre-1931 English Texts for Historical Reasoning and Generalization Research

Imagine a model of a language that had not heard about the Internet, smartphones or World War II. That’s not a hypothetical — it’s exactly what a team of researchers led by Nick Levine, David Duvenaud, and Alec Radford has built. It is called TalkieThis is possibly the most disciplined and historically accurate large language model that’s ever been released.

The Talkie model is an open-weight, 13-billion-parameter language model that was trained on only pre-1931 English texts. The team behind the project, a non profit organization, has developed a language model that they refer to as a “vintage language model” — an LM with a hard knowledge cutoff tied not to when it was trained, but to a specific moment in history.

What is a vintage language model?

Understanding talkie requires understanding its concept. Most modern LLMs like GPT-4, LLaMA, Mistral etc. The training is based on crawls and searches of the current internet. The world is as it was at the time of training. The vintage model is the opposite: It deliberately trains only using historical data. “worldview&#8221The ‘;’ is set at an earlier point.

Talkie is a cutoff for the phone. 31 December 1930 — chosen precisely because that is the date when works enter the public domain in the United States, making pre-1931 text legally usable for training.

The model — formally named talkie-1930-13b-base — was trained on There are 260 billion tokens A variety of pre-1931 English documents including newspapers, books, journals, scientific publications, patents and caselaw. Conversational Checkpoints that are separately trained and post-trained talkie-1930-13b-itThe interactive version of is available. The team has set up a 24/7 live demo at talkie-lm.com/chat where Claude Sonnet 4.6 continuously prompts the instruction-tuned model, allowing visitors to observe talkie’s voice and knowledge in real time.

Why Choose a Model from 1930?

This isn’t a nostalgia project. Talkie is interesting for the AI community because of several concrete and technically relevant use cases identified by the research team.

1. Tests of generalization without contamination: Benchmark contamination, where test data inadvertently leaks into training data — is one of the most persistent and underappreciated problems in modern LLM evaluation. As talkie has only been trained with text before 1931, its construction is contaminant-free in comparison to other benchmarks. The clean environment allows for a more accurate evaluation of how well LMs can learn from their pre-training information. For example, the team tested whether talkie could learn Python — a language that didn’t exist in 1930 — by providing a few in-context demonstration examples. By using the HumanEval The benchmark found that although vintage models underperformed web-trained, they performed better. “slowly but steadily improving at this task with scale.”

2. Evaluation of forecasting and the temporal surprise: Inspired by Calcifer Computing’s work on Temporal Language Models, the research team used talkie to measure the Surprisingness The historical events described in the New York Times‘s “On This Day” feature. Events after 1930 — talkie’s knowledge cutoff — are consistently more surprising to the model, with the effect most pronounced for 1950s and 1960s events, followed by a plateau. The model can now be used to study how the forecasting abilities scale with model size, and how performance declines over longer time horizons.

3. LLM identity formation and persona: Because talkie was trained on a fundamentally different distribution than any modern model, it opens up questions about what shapes an LLM’s “identity.” Modern LLMs — regardless of their provider — all share a common ancestor in web data, whether through direct training or through distillation and synthetic data pipelines. Talkie completely breaks this lineage, providing researchers with a way to determine what behavior and abilities are common to all language models versus those that are an artifact of the training available on today’s web.

It’s hard to Train in the Pipeline

It isn’t as easy as simply filtering modern data by date to create a vintage-style language model. The team that researched talkies faced several engineering challenges.

Temporal leakage The most important thing is to make sure that the model’s historical fidelity does not suffer. If any post-1930 text slips into the training corpus — through misdated documents, or old texts with anachronistic editorial introductions — the model’s historical fidelity is compromised. The earlier version 7B of talkie was clearly aware of the New Deal and Roosevelt’s presidency, showing a lackluster filtering. The team created a Classifier of document-level, ngram-based Anachronism to filter the corpus, but acknowledge this is still imperfect — the 13B version retains some awareness of World War II and the postwar order.

Data quality Another major barrier is the lack of digital publishing in 1930. Because there was no digital publishing in 1930, every token in talkie’s training corpus had to be transcribed from physical sources via optical character recognition (OCR). The team conducted controlled experiments and found that conventional OCR transcriptions produced only a 5% improvement in performance. Learning efficiency is 30% A model that is trained using human transcriptions of texts. A simple regex cleanup improved this to 70% but there was still a large gap. The company is building a dedicated facility to help close this gap. Vintage OCR System Fine-tuned to historical document formats.

Vintage after-training: the instruction-tuning phase — required building an entirely new pipeline from scratch. Using modern instruction-response pairs would inject contemporary expectations into the model’s behavior. The team instead generated instructions-response pairings from historical texts, such as etiquette books, letter writing manuals and cookbooks. They also used dictionaries and encyclopedias. Then they ran online direct preference optimization (DPO) Use this link to learn more about Claude Sonnet 4.6 as a judge, improving talkie’s average instruction-following rating from 2.0 to 3.4 on a five-point scale. The final stage of fine tuning was performed using rejection samples from synthetic conversations generated by Claude Opus 4.6.

Benchmarks: What Does the 1930 Model Compare to?

In order to give context and meaning, our research team was trained by a “modern twin” — an architecturally identical 13B model trained on modern web data (FineWeb) — and compared it against talkie. Talkie, as expected, performs worse than its modern counterpart in standard LM assessments. The results are different when you control for question anachronism — filtering out questions that reference concepts that wouldn’t exist in 1930 — the performance gap roughly halves. Researchers note encouraging parity in core numeracy and language tasks. The remaining gap is attributed to OCR noise, as well as differences in subject distribution.

The Key Takeaways

Talkie, a 13B Open-Weight “vintage language model” trained on 260 billion tokens of exclusively pre-1931 English text — making it the largest vintage LM known, with a hard knowledge cutoff of December 31, 1930.
Design eliminates benchmark contamination. Because talkie has never seen modern data, it serves as a uniquely clean testbed for generalization experiments — including whether a model with no knowledge of digital computers can learn to write Python code from in-context examples alone.
Filtering vintage LMs is much more challenging than building one. Researchers had to deal with temporal leakage of data (post-1930 information slipping into the pipeline), OCR noise, which reduced training efficiency by only 30% compared to human transcriptions. They also needed to build an entire post-training system from sources dating back to pre-1931 such as encyclopedias and etiquette books.
Apache 2.0 has two publicly accessible checkpoints. talkie-1930-13b-base Raw completions talkie-1930-13b-it for conversation — but running them locally requires a CUDA GPU with at least 28 GB VRAM.
Coming soon, bigger models The research team is targeting a GPT-3-level vintage model by summer 2026, with a corpus they estimate can scale to over a trillion tokens — potentially enough to match the capability of the original ChatGPT, frozen in 1930.

Check out the Model Weights, Repo The following are some examples of how to get started: Technical details. Also, feel free to follow us on Twitter Join our Facebook group! 130k+ ML SubReddit Subscribe now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.

You can partner with us to promote your GitHub Repository OR Hugging Page OR New Product Launch OR Webinar, etc.? Connect with us

Post Meet Talkie-1930: A 13B Open-Weight LLM Trained on Pre-1931 English Text for Historical Reasoning and Generalization Research The first time that appeared on MarkTechPost.

Meet Talkie-1930, a 13B LLM Open-Weight Trained in Pre-1931 English Texts for Historical Reasoning and Generalization Research

How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control

OpenMOSS Releases the MOSS Audio: A Foundation Open Source Model for Sound, Speech, Music and Time-Aware Reasoning.

Create a reinforcement learning powered agent that learns to retrieve relevant long-term memories for accurate LLM question answering

How to Create a Searchable AI Knowledgebase with OpenKB OpenRouter and LlamaDocs =

AI will never be conscious

Some Musk v. Altman jurors don’t like Elon Musk

Free Local RAG Scraper for Custom GPTs and Assistants • AI Blog

After attending a screening of an AI film festival, I left with more questions than solutions

‘Fallout’ Producer Jonathan Nolan on AI: ‘We’re in Such a Frothy Moment’

Top Insights

Gemini Flash-Lite is now the fastest proprietary model (external tests) with 50% fewer output tokens.

“Create a replica of this image. Don’t change anything” AI Trend Takes Off

Latest News

How to Build a Lightweight Vision-Language-Action-Inspired Embodied Agent with Latent World Modeling and Model Predictive Control

Meet Talkie-1930, a 13B LLM Open-Weight Trained in Pre-1931 English Texts for Historical Reasoning and Generalization Research

Meet Talkie-1930, a 13B LLM Open-Weight Trained in Pre-1931 English Texts for Historical Reasoning and Generalization Research

What is a vintage language model?

Why Choose a Model from 1930?

It’s hard to Train in the Pipeline

Benchmarks: What Does the 1930 Model Compare to?

The Key Takeaways

Related Posts