YaMBDa: Yandex open-sources massive RecSys dataset with nearly 5B user interactions.

Yandex researchers have just released YaMBDa: a large-scale dataset for recommender systems with 4.79 billion user interactions from Yandex Music. This includes data from its personalized real-time music feed, My Wave.

The set contains plays, likes/dislikes, timestamps, and track features — all anonymized using numeric IDs. While the source is music-related, YaMBDa is designed for general-purpose RecSys tasks beyond streaming.

This is a pretty big deal since progress in RecSys has been bottlenecked by limited access to high-quality, realistic datasets. Even with LLMs and fast training cycles, there’s still a shortage of data that approximates real-world production loads.

Popular datasets like LFM-1B, LFM-2B, and MLHD-27B have become unavailable due to licensing issues. Criteo’s 4B ad dataset used to be the largest of its kind, but YaMBDa has apparently surpassed it with nearly 5 billion interaction events.

🔍 What’s in the dataset:

3 dataset sizes: 50M, 500M, and full 4.79B events
Audio-based track embeddings (via CNN)
Metadata (track duration, artist, album, etc.)
is_organic flag to separate organic vs. recommended actions
Parquet format, compatible with Pandas, Polars, and Spark

🔗 The dataset is hosted on HuggingFace, the benchmark code is on GitHub, and the research paper is available on arXiv.

Let me know if anyone’s already experimenting with it — would love to hear how it performs across different RecSys approaches!

submitted by /u/azalio to r/learnmachinelearning
[link] [comments]


Commentaires

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *