First I wanted to say that this is my first time trying this. Secondly, I'm not sure I'm placing this question in the right forum. If not, please excuse me.
I'm trying to use Naive Bayes on my data. Click here to download the dataset.
This is my code till now:
data = pd.read_json('/Users/rokayadarai/Desktop/Coding/DataSets/Hotel_Reviews.json')
data.head()
#stopword are not usefull (a, and, the)
stopset = set(stopwords.words('english'))
vectorizer = TfidfVectorizer(use_idf=True, lowercase=True, strip_accents='ascii', stop_words=stopset)
y = data['Reviewer_Score']
X = scipy.sparse.hstack([vectorizer.fit_transform(data['Negative_Review']),
vectorizer.fit_transform(data['Positive_Review'])]
)
#515738 observations and 106514 unique words
print (y.shape)
print (X.shape)
#split the data - 0.2 means 20% of the data. 123 means use same dataset with every test
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.2,random_state=123)
#train naive bayes classifier
clf = naive_bayes.GaussianNB()
clf.fit(X_train, y_train)
#test model's accuracy
roc_auc_score(y_test, clf.predict_proba(X_test)[:,1])
And this is the error I get:
TypeError: A sparse matrix was passed, but dense data is required. Use X.toarray() to convert to a dense NumPy array.
Could please somebody helps me out? I'm stuck. I know I'm doing something wrong, but I can't figure out what and can't seem to find anything on the Internet to help me.