Yandex released its new recommender system, which has made an important contribution to the community of recommender systems. YambdaThe world’s biggest dataset publicly available for recommender system development and research. This dataset is designed to bridge the gap between academic research and industry-scale applications, offering nearly 5 billion anonymized user interaction events from Yandex Music — one of the company’s flagship streaming services with over 28 million monthly users.
Yambda: A Critical Gap in Recommender Systems
Today, recommender systems are at the heart of personalized digital experiences across a wide range of services – from social media and ecommerce to streaming platforms. They rely on huge volumes of behavior data (such as likes, clicks, or listens) to deduce user preferences and provide tailored content.
However, recommender systems have lagged behind some AI areas, including natural language processors. This has been largely due the lack of openly available, large datasets. Unlike large language models (LLMs), which learn from publicly available text sources, recommender systems need sensitive behavioral data — which is commercially valuable and hard to anonymize. Companies have historically guarded these data very closely and limited researchers’ access.
There are currently datasets like Spotify’s Million Playlist Dataset or Netflix Prize Data that are either insufficient, do not provide enough temporal information, or have inadequate documentation to develop production-grade recommendation models. Yandex has released a new version of Yambda The solution to these problems is a large, comprehensive dataset with many features.
Yambda’s Scale, Wealth, and Privacy
You can also find out more about the following: Yambda The dataset consists of 4.79 billion anonymous user interactions collected during a period of 10 months. This dataset is a result of roughly one million Yandex Music listeners interacting on nearly 9,4 million songs. This dataset contains:
- Users Interactions The feedback can be both implicit (listening) and explicit (likes and dislikes and removal of them).
- Anonymized Audio Embeddings: The vector representation of tracks is derived using convolutional networks. This allows models to exploit audio content similarity.
- Sustainable Interaction Flags The following are some of the ways to get in touch with each other “is_organic” The flag shows whether a user discovered a track on their own or through recommendations. This allows for behavioral analysis.
- Precise Timestamps: It is important to model sequential behavior that each event be timestamped in order to maintain the temporal sequence.
In order to meet privacy standards and ensure that personally identifiable data is not exposed, we anonymize all track and user identifiers using numeric identifiers.
Datasets are provided in Apache Parquet which is compatible with many analytical libraries, including Pandas, Polars, Spark, Hadoop and Apache Spark. Yambda can be used by researchers, developers and users in a variety of environments.
Global Temporal Split Evaluation Method
Yandex datasets are a great example of innovation. Global Temporal Split evaluation strategy. Leave-One Out is a widely-used method in typical recommender-system research that removes each user’s last interaction from testing. This method disrupts user interaction continuity, resulting in unrealistic training.
GTS splits up the data according to timestamps while preserving all the events. This mimics the real-world scenarios better because no future data leaks into training, and models can be tested using truly unrecognized interactions.
The use of a temporal-aware benchmark is crucial for evaluating algorithms and their performance under real constraints.
Baseline Models & Metrics Included
Yandex offers baseline recommender model implementations on datasets to support benchmarking.
- MostPop: Popularity-based models recommend the best items.
- DecayPop: The popularity of a model that decays over time.
- ItemKNN: Neighborhood-based collaboration filtering is a method for collaborative filtering.
- iALS: Matrix factorization using implicit alternate Least Squares.
- BPR: Bayesian Personalized Ranking is a method of ranking based on pairs.
- SANSA and SASRec Models that are aware of sequences and use self-attention.
This baseline is evaluated by standard recommender metrics like:
- The NDCG@k is the Normalized Discounted Compound Gain. The ranking of quality is based on the importance placed upon relevant items.
- Recall@k: Calculates the percentage of items that are relevant.
- Coverage@k: This indicates the variety of suggestions across the catalogue.
These benchmarks help researchers compare the performance of new methods with established ones.
Applicability beyond Music Streaming
The dataset is derived from a streaming music service but its use extends beyond this domain. Yambda’s interaction types, dynamic user behaviour, and massive scale makes it a universal benchmark across industries like video platforms, ecommerce, and social networking. This dataset allows algorithms to be adapted or generalized for different recommendation tasks.
The Benefits to Different Stakeholders
- Academia: This allows rigorous testing at an appropriate scale of theory and new algorithms.
- Startups, SMBs and other small businesses: This resource is similar to that of the biggest tech companies, allowing for a level playing field.
- Final Users Benefit indirectly from intelligent recommendation algorithms, which improve content discovery by reducing search time and increasing engagement.
Yandex’s Personalized Recommender System: My Wave
Yandex Music relies on a proprietary system of recommenders called My WaveThe system uses AI and deep neural network technology to customize music suggestions. My Wave analyses thousands of factors, including:
- Listening history and user interaction sequences.
- You can customize your preferences, such as the mood or language.
- Music analysis in real-time, including spectrograms and vocal tones, as well as frequency ranges.
This system is dynamically adaptable to the individual by identifying similarities in audio and predicting preference. It demonstrates the type of complex recommendation pipe-line that can be benefited from Yambda’s large datasets.
Ensure privacy and ethical use
This release is a release that has been released by Yambda The importance of privacy is highlighted in the research on recommender systems. Yandex anonymousizes the data by using IDs that are numeric and does not include any personally identifiable information. This dataset only contains interaction signals, without disclosing exact identities of users or sensitive attributes.
It is important to maintain a balance between privacy and openness in order to conduct robust research.
Access to versions
Yandex offers three different size options for the Yambda dataset to meet the needs of researchers and users with different computing capacities.
- Full Version: 5 billion event.
- Medium version: ~500 million events.
- The Small Version ~50 million events.
You can access all versions via Hugging FaceIt is a platform that hosts datasets and models for machine learning, making it easy to integrate into research workflows.
You can also read our conclusion.
Yandex has released the Yambda This dataset is a significant milestone in the field of recommender system science. This dataset, which provides an unprecedented amount of anonymous interaction data coupled with a temporal-aware baseline and evaluation system, sets a standard for benchmarking innovation. Now, researchers, entrepreneurs, and startups can explore and create recommender systems which better reflect actual usage.
Datasets such as Yambda are essential to pushing AI personalization’s limits.
Click here to find out more Yambda Dataset on Hugging Face.
Note: Thanks to the Yandex team for the thought leadership/ Resources for this article. The Yandex Team has sponsored and supported this article.
Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to harnessing Artificial Intelligence’s potential for the social good. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. Over 2 million views per month are a testament to the platform’s popularity.

