Yandex researchers have just released YaMBDa: a large-scale dataset for recommender systems with 4.79 billion user interactions from Yandex Music. This includes data from its personalized real-time music feed, My Wave.
The set contains plays, likes/dislikes, timestamps, and track features — all anonymized using numeric IDs. While the source is music-related, YaMBDa is designed for general-purpose RecSys tasks beyond streaming.
This is a pretty big deal since progress in RecSys has been bottlenecked by limited access to high-quality, realistic datasets. Even with LLMs and fast training cycles, there’s still a shortage of data that approximates real-world production loads.
Popular datasets like LFM-1B, LFM-2B, and MLHD-27B have become unavailable due to licensing issues. Criteo’s 4B ad dataset used to be the largest of its kind, but YaMBDa has apparently surpassed it with nearly 5 billion interaction events.
🔍 What’s in the dataset:
3 dataset sizes: 50M, 500M, and full 4.79B events
Audio-based track embeddings (via CNN)
Metadata (track duration, artist, album, etc.)
is_organic flag to separate organic vs. recommended actions
Parquet format, compatible with Pandas, Polars, and Spark
🔗 The dataset is hosted on HuggingFace, the benchmark code is on GitHub, and the research paper is available on arXiv.
Let me know if anyone’s already experimenting with it — would love to hear how it performs across different RecSys approaches!
submitted by /u/azalio to r/learnmachinelearning
[link] [comments]
Laisser un commentaire