Need help with Yelp Dataset

Août 13, 2025

—

Hi,

So I’m working on an assignment using the Yelp Open Dataset. The task is to analyze hospitality review data (hotels, restaurants, spas) not for ratings, but for signs of unfair treatment, bias, or systemic behavior that could impact access, experience, or rep

Problem is even before I’ve started doing anh EDA or text anlysis. The dataset’s categories field in business.json is super messy – 1,300+ unique labels, many long combined strings and types of venues (e.g., « American (Traditional), Bars, Nightlife, Pub, Bistro etc. etc. » ). I’ve used category matching and fuzzy string matching. My filters for hospitality keywords keep returning only a few or 0 matches, and the assignment only specifies « hotels, restaurants, spas » without further guidance. The prof said that’s all that can be said to help.

Is there a way to substring match and/or reliably way to pull all hospitality businesses (hotels, restaurants, spas) from the dataset?

Cheers

submitted by /u/Proof-Combination334 to r/learnmachinelearning
[link] [comments]

Need help with Yelp Dataset

Commentaires

Laisser un commentaire Annuler la réponse