Semisupervised Aspect Extraction: Demonstration of Topic Modelling Techniques on Restaurant Reviews¶

__author__ = "Amoli Rajgor"
__email__ = "amoli.rajgor@gmail.com"
__website__ = "amolir.github.io"

Introduction¶

Aspect Extraction¶

Given a set of restaurant reviews, identify and extract words or targets belonging to the user sentiment presented in it. Use domain knowledge to enhance the performance of the topic model by setting prior probabilities for keywords.

Environment and Project Flow¶

The project pipeline is divided into three stages:

Data Preparation
Data Cleaning
Topic Modeling

There’s a dedicated notebook for the data preparation stage containing implementation of all the intermediate steps. At the end of each stage the processed data is stored in the form of a CSV file. Current notebook will focus on stage 2 and stage 3.
I will be using the following list of packages for the project.

ℹ️ Dependencies

➤ numpy ≥ 1.22.3
➤ pandas ≥ 1.4.1
➤ scipy ≥ 1.8.0
➤ gensim ≥ 4.1.2
➤ spacy ≥ 3.3.0
➤ spacy-model-en_core_web_md ≥ 3.3.0
➤ nltk ≥ 3.5
➤ scikit-learn ≥ 1.0.2
➤ gsdmm ≥ 0.1
➤ pyldavis ≥ 3.3.1
➤ networkx
➤ corextopic ≥ 1.1
➤ pickle
➤ html
➤ re
➤ string
➤ collections

Usage Instructions¶

Simply run pip install requirements.txt from the terminal to install all the dependencies before running the notebook.
Download the data. Create a folder named data and place the downloaded .json file inside it.
All the intermediate transformed dataset (restaurant_review.csv and aspect.csv) will be stored in the data folder itself.
Cleaned dataframe is stored as pickle objects cleaned.pickle. Similarly stopwords learned during the training are stored as pickle object stopwords.pickle.
Pre-defined list of stopwords, seed words and topic names is defined in stopwordslist.py
Run the notebooks in the sequence data_preparation.ipybn -> aspect_extraction.ipynb to generate results.

# Data Manipulation
import pandas as pd
import numpy as np

# html parsing
import html

# RegEx and String Manipulation
import re
import string

# For Sentence Tokenization
from spacy.lang.en import English

# For Text Cleaning
import spacy

# Storing Objects
import pickle

# For finding frequent words
from collections import Counter
from nltk import FreqDist

# For Bigrams
from gensim.models import Phrases
from gensim.models.phrases import Phraser

# For LDA
from gensim import corpora
from gensim import models

# For GSDMM
from gsdmm import MovieGroupProcess
from gensim.models.coherencemodel import CoherenceModel

# Visualization
import matplotlib.pyplot as plt

# Visualize LDA
import pyLDAvis
import pyLDAvis.gensim_models

# For generating sparse data matrix
import scipy.sparse as ss

# For CorEx topic modelling
import corextopic.corextopic as ct
import corextopic.vis_topic as vt 

# To generate bow
from sklearn.feature_extraction.text import CountVectorizer

import warnings
warnings.filterwarnings('ignore')
# warnings.filterwarnings("ignore", category=DeprecationWarning)
# warnings.filterwarnings("ignore", category=FutureWarning)

# Import custom stopwords and topic words list
from stopwordslist import *

Data¶

Dataset used for this project can be found here. It contains restaurant reviews posted on Trip Advisor about 57 restaurants in the Bay Area of San Francisco (100 reviews each). Further it provides information such as restaurant name, location details, contact details, information on active hours and facilities and reviews etc. I will only use the reviews data that is needed to extract the various aspects of the review to later use it as input for sentiment analysis. The subset of the data is generated during the data preparation stage and only information such as restaurant id, name, review_title, and review_text is retained.

reviews = pd.read_csv("data/restaurant_review.csv")
print(f'\n\033[1mShape of the data: \033[0m {reviews.shape}', end="\n\n")
display(reviews.head())

Shape of the data:  (14700, 4)

Text Preparation¶

From a quick inspection of the review texts, we see the presence of some HTML entities, incorrectly ending punctuations, incorrect placement of blank spaces etc. So before extracting features for the data, let's first clean it. Moreover, tokenize the texts into sentences to extract features. Steps involved in the text preparation stage are as follows:

Parse html entities into its readable interpretation. (For example " to ")
Add space after sentence-ending permutation (if not) such as !, ?, and .
Sentence Tokenization: Tokenise texts into sentences, such that each sentence becomes a new data point (row) in the data.
Further break the sentences into clauses.
Word Tokenization: Tokenise texts into words.
- Remove standard stopwords
- Remove punctuations
- Convert letter case for each token to lower case.
- Internal POS tagging to eliminate tags that do not have POS tag of interest
- Remove tokens that are too short
Bigrams and Trigrams: Create Bigrams and Trigrams to capture meaningful collocations such as "dining_experience", "customer_service", "outdoor_seating" etc.
Save the cleaned data at this point to avoid recomputation. Next to steps may require re-runs to facilitate removal of newly learned stopwords.
Based on frequency, find the most and least occurring tokens. Create a custom stopwords list using this information. Remove least occurring (0.001%) tokens.
Lemmatization: Lemmatization to convert words into its root representation to reduce size of the vocabulary and increase interpretability.

# html entity in the text
print("\n\033[1m{}\033[0m".format("Text before parsing the HTML entities:"))
print(reviews.review_text[11],end ="\n")

reviews["text"] = reviews["review_text"].apply(html.unescape)

# Entity removed
print("\n\033[1m{}\033[0m".format("Text after parsing the HTML entities:"))
print(reviews.text[11])

Text before parsing the HTML entities:
This is a fun, local &quot;watering hole.&quot; Met some friends for beers and burgers (try the Reuben, too) and really delicious onion rings. Nice to find a bar with great food. This place consistently has really good food -- sometimes, it is really crowded with slower service (expected). One of my new, favorite stops in the City!

Text after parsing the HTML entities:
This is a fun, local "watering hole." Met some friends for beers and burgers (try the Reuben, too) and really delicious onion rings. Nice to find a bar with great food. This place consistently has really good food -- sometimes, it is really crowded with slower service (expected). One of my new, favorite stops in the City!

# Incorrect punctuation, sentence will not split
print("\n\033[1m{}\033[0m".format("Text with incorrect terminating punctuation in the first line:"))
print(reviews.text[52][-216:], end ="\n")

# Add space after punctuations
punct_trail_space_pattern = re.compile(r'(\d+\.\d+|\b[A-Z](?:\.[A-Z])*\b\.?)|([.,;:!?])\s*')

def add_trail_space(text):
    return re.sub(punct_trail_space_pattern, lambda x: x.group(1) or f'{x.group(2)} ', text)

# Remove URLs from all tweets
reviews["text"] = reviews["text"].apply(add_trail_space)

print("\n\033[1m{}\033[0m".format("Text with corrected punctuation:"))
print(reviews.text[52][-218:])

Text with incorrect terminating punctuation in the first line:
This brew pub is well decorated and comfortable.But the real reason to visit is, of course the beer. All of the beer I tired was very good. They also had some variations on theme that I have never experienced before.

Text with corrected punctuation:
This brew pub is well decorated and comfortable. But the real reason to visit is, of course the beer. All of the beer I tired was very good. They also had some variations on theme that I have never experienced before.

Sentence Tokenization¶

In a review text, each sentence may refer to a sentiment about different aspects of the restaurant such as food, service ambience etc. In order to fully understand the sentiment behind the review, we need to understand the sentiment behind each of these entities. This is only possible if the text representing the review is filtered down to a stage where it only constitutes of one aspect and a sentiment regarding it (Note: This also generates the possibility of having texts with no aspect.) To lay the foundation for this we first need to break the text into sentences and treat each of these fragments as a document.

# Text with multiple sentences
print("\n\033[1m{}\033[0m".format("Review containing multiple sentences:"))
print(reviews.review_text[1], end="\n")

# Empty pipeline with only language and no model
nlp = English()  
nlp.add_pipe("sentencizer")

def sentence_tokenizer(doc):
     return [sent.text for sent in doc.sents]

def tokenize_pipe(texts):
    preproc_pipe = []
    for doc in nlp.pipe(texts, batch_size=20):
        preproc_pipe.append(sentence_tokenizer(doc))
    return preproc_pipe

reviews['text'] = tokenize_pipe(reviews['text'])

print("\n\033[1m{}\033[0m".format("Review split into separate sentences:"))
display(reviews.text[1])

Review containing multiple sentences:
We went to the downtown SF location. The restaurant was really clean and servers were nice!Foods were great! We had a Burger and Ruben Sandwich! Delicious! We ordered a flight to taste 6 different beers! We enjoyed their brown ale and stout!We recommend this place for friends gathering!

Review split into separate sentences:

['We went to the downtown SF location.',
 'The restaurant was really clean and servers were nice!',
 'Foods were great!',
 'We had a Burger and Ruben Sandwich!',
 'Delicious!',
 'We ordered a flight to taste 6 different beers!',
 'We enjoyed their brown ale and stout!',
 'We recommend this place for friends gathering!']

# Flatten the list, such that each sentence is assigned new row
reviews = reviews.explode("text", ignore_index=True)
reviews.head()

reviews.shape

(89769, 5)

In the review text, we also have sentences that contain clauses representing different sets of information (usually separated by semicolon or comma). Consider the below example where service,staff and _menu_offering_ all three aspects are part of the single sentence, conveyed through different clauses. It is not that easy to break complex compound sentences by rule based filtering, rather it requires efficient parsers of the language model that can interpret complex parts of speech and Penn Treebank tags dependencies. But examples like below where the independent clauses are separated by semicolons could be split into separate documents (by ;) using simple regex.

# Clause seperation
print("\n\033[1m{}\033[0m".format("Review with clauses:"))
print(reviews.text[410])
# print(reviews.text[2], end="\n\n")

# Split sentences having multiple clauses seperated by ";"
def split_sentence_clauses(text):
    return re.split(r"(?<!\w\;\w.)(?<![A-Z][a-z]\;)(?<=\;)\s", text)
    # return re.split(r"(?<!\w\;\w.)(?<![A-Z][a-z]\;)(?<=\;)\s|(?<!\w\,\w.)(?<![A-Z][a-z]\,)(?<=\,)\s", text)

reviews["text"] = reviews["text"].apply(split_sentence_clauses)

print("\n\033[1m{}\033[0m".format("Clauses split as documents:"))
print(reviews.text[410])
# print(reviews.text[2])

Review with clauses:
The service was very good; the helpful waitress spoke knowledgeably about the menu and we were pleased with our choices.

Clauses split as documents:
['The service was very good;', 'the helpful waitress spoke knowledgeably about the menu and we were pleased with our choices.']

# Flatten the list, such that each sentence is assigned new row
reviews = reviews.explode("text", ignore_index=True)

# Clauses are split
display(reviews.iloc[413:415])
# display(reviews.iloc[np.r_[2:5, 569:571]])

reviews.shape

(90581, 5)

Word Tokenization¶

nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
nlp.add_pipe('sentencizer')
stopwords = set(nlp.Defaults.stop_words)

def tokenize_lda_pipe(doc, stopwords, pos_list):
    lemma_list = [token.text.lower() for token in doc if token.is_alpha and
                      token.pos_ in pos_list and len(token)>2 and 
                      token.text.lower() not in stopwords] 
    return " ".join(lemma_list)

def preprocess_lda_pipe(texts, stopwords, pos_list = ["NOUN", "ADJ", "VERB"]):
    processed_pipe = []
    for doc in nlp.pipe(texts, batch_size=1000):
        processed_pipe.append(tokenize_lda_pipe(doc, stopwords, pos_list=pos_list))
    return processed_pipe

%time reviews['lda_tokens'] = preprocess_lda_pipe(reviews['text'], stopwords, ["NOUN", "ADJ", "VERB", "ADV"])

CPU times: user 1min 45s, sys: 210 ms, total: 1min 46s
Wall time: 1min 46s

%time reviews['lda_tokens_noun'] = preprocess_lda_pipe(reviews['text'], stopwords, ["NOUN"])

CPU times: user 1min 49s, sys: 304 ms, total: 1min 49s
Wall time: 1min 49s

# Remove empty lists
reviews = reviews[reviews['lda_tokens'].map(lambda d: len(d)) > 0].copy()
reviews = reviews.reset_index().rename(columns={"index":'review_id', "id":"restaurant_id"})

reviews.head(8)

Bigrams and Trigrams¶

# Find all the tokens
review_tokens = reviews.lda_tokens.str.split(" ").tolist()

# Train the model to learn ngrams
bigram_model = Phrases(review_tokens, min_count=30, threshold=5)
bigram_phraser = Phraser(bigram_model)

trigram_model = Phrases(bigram_model[review_tokens], min_count=15, threshold=5)
trigram_phraser = Phraser(trigram_model)

def get_ngram(doc):
    l = [token.text for token in doc]
    l = trigram_phraser[bigram_phraser[l]]
    return " ".join(l)

def ngram_lda_pipe(texts):
    ngram_pipe = []
    for doc in nlp.pipe(texts, batch_size=1000):
        ngram_pipe.append(get_ngram(doc))
    return ngram_pipe

%time reviews['lda_tokens'] = ngram_lda_pipe(reviews['lda_tokens'])

CPU times: user 53.5 s, sys: 11.8 ms, total: 53.5 s
Wall time: 53.5 s

reviews.head()

# with open('cleaned.pickle', 'wb') as handle:
#     pickle.dump(reviews, handle, protocol=pickle.HIGHEST_PROTOCOL)

Frequent Words¶

Find the most occurring words from the corpus and use this information wisely to create a relevant set of stopwords. We will not remove all the most frequent words because they are keywords in determining aspects such as Food, Service Staff Ambience etc. Moreover later, the prominent 1000 words are used to form seed words for semisupervised LDA.

# Frequency distribution of all the words 
def get_most_frequent_words(all_words_list, word_limit = 1000, remove_stopwords = False):
    corpus_list = [word for word_list in all_words_list for word in word_list]
    if remove_stopwords:
        stopwords = set(nlp.Defaults.stop_words)
        corpus_list = [word for word in corpus_list if word.lower() not in stopwords]
    f_dist = FreqDist(corpus_list)  
    if word_limit:
        # Word limit not set return word_limit number of word in descreasing order of occurence
        top_words, _ = zip(*f_dist.most_common(word_limit)) 
    else:
        # Word limit not set return all the word in descreasing order of occurence
        top_words, _ = zip(*f_dist.most_common()) 
    return top_words

# Get a list of most frequently used words
# print(get_most_frequent_words(reviews.loc[:, 'lda_tokens'].str.split(), 1000, True)[0:100]) 
frequent_words_list = get_most_frequent_words(reviews.loc[:, 'lda_tokens'].str.split(), None, False)

print(f'\n\033[1m Most Frequent Words: \033[0m {frequent_words_list[1:100]}')

 Most Frequent Words:  ('good', 'great', 'place', 'service', 'restaurant', 'nice', 'time', 'delicious', 'menu', 'excellent', 'best', 'ordered', 'dinner', 'meal', 'table', 'eat', 'went', 'atmosphere', 'lunch', 'little', 'try', 'fresh', 'staff', 'order', 'breakfast', 'amazing', 'got', 'experience', 'small', 'bar', 'came', 'dishes', 'tasty', 'friendly', 'visit', 'people', 'night', 'salad', 'day', 'better', 'area', 'enjoyed', 'pizza', 'wonderful', 'bit', 'love', 'chicken', 'drinks', 'served', 'way', 'definitely', 'burger', 'special', 'location', 'waiter', 'busy', 'right', 'wait', 'wine', 'perfect', 'wife', 'dish', 'loved', 'come', 'want', 'worth', 'husband', 'recommend', 'fantastic', 'sauce', 'sure', 'find', 'friends', 'restaurants', 'price', 'going', 'server', 'taste', 'tables', 'like', 'tried', 'bread', 'enjoy', 'said', 'times', 'found', 'cheese', 'think', 'ate', 'know', 'dessert', 'evening', 'coffee', 'lovely', 'prices', 'took', 'asked', 'different', 'family')

print(f'\n\033[1m Total number of words in the vocabulary:\033[0m {len(frequent_words_list)}')

 Total number of words in the vocabulary: 17657

Eliminate Least Occuring Words¶

Choose a threshold to remove least occurring words. Here we find words that occur less than 0.001% of the total words in the corpus.

# Threshold showing percentage occurrance of the word
threshold = 0.001

all_word_list = reviews.lda_tokens.str.split(" ").to_list()
corpus_list = [word for word_list in all_word_list for word in word_list]
topic_occurence = Counter(corpus_list).most_common()
topic_occurence = [(topic, freq, round(float(freq) * 100 /len(corpus_list),4)) for topic, freq in topic_occurence]
less_than_point_001_percent_lemma = [topic for topic, freq, percent in topic_occurence if percent < threshold]
# less_than_5_freq_lemma = [topic for topic, freq, percent in topic_occurence if freq < 5]

with open('stopwords.pickle', 'wb') as handle:
    pickle.dump(less_than_point_001_percent_lemma, handle, protocol=pickle.HIGHEST_PROTOCOL)
    # pickle.dump(less_than_5_freq_lemma, handle, protocol=pickle.HIGHEST_PROTOCOL)
    
# topic_occurence
print("\n\033[1m{}\033[0m".format("List of words occurring less than 0.001% times:"))
print(less_than_point_001_percent_lemma[1:10])
print("\n\033[1m{}\033[0m".format("Top words occurring less than 0.001% times:"))
topic_occurence[-10::1]

List of words occurring less than 0.001% times:
['amendment', 'complements', 'souvenir', 'brined', 'empties', 'samplers', 'whilst', 'technically', 'failure']

Top words occurring less than 0.001% times:

[('vain', 1, 0.0002),
 ('impede', 1, 0.0002),
 ('acomadating', 1, 0.0002),
 ('liabasions', 1, 0.0002),
 ('attachd', 1, 0.0002),
 ('distence', 1, 0.0002),
 ('col', 1, 0.0002),
 ('unpopular', 1, 0.0002),
 ('bending', 1, 0.0002),
 ('obsolete', 1, 0.0002)]

It can be observed that the majority of words in the least occurring list are misspelt words, spell checking is a processing heavy task hence we are not explicitly correcting words, though doing so will improve lemmatization result but assuming there won't be much misspelt words we are simply removing them from the vocabulary now. Spelling correction function as shown below (demo) can be included as part of the data cleaning pipeline.

from nltk.metrics.distance  import edit_distance
from nltk.corpus import words
correctly_spelled_words = words.words()
misspelled = ['acomadating', 'attachd', 'distence']
  
for word in misspelled:
    temp = [(edit_distance(word, w),w) for w in correctly_spelled_words if w[0]==word[0]]
    print(sorted(temp, key = lambda val:val[0])[0][1])

accommodating
attach
distance

Learning Stopwords¶

One part of the training process involves updating the list of stopwords with words that are found irrelevant after running the model. These words appear as the top words for a topic and don't really add much to the semantics of the topic content. For example, we are interested in keeping VERB tokens during lemmatization that adds meaning to the reviews such as: eat, serve, dine, recommend, complain, clean etc. but, doing so also retains tokens like catch, allow, save, suppose, speak, carry, move etc. which are irrelevant and ambiguous and can represent multiple topics. Because of this we have to constantly refine the vocabulary to only keep tokens that are useful in creating distinct topics, thus we will use the cleaned reviews data for every fresh run of the model and update the lemma_stopwords list based on the results.

with open('cleaned.pickle', 'rb') as handle:
    reviews = pickle.load(handle)

reviews.head()

with open('stopwords.pickle', 'rb') as handle:
    less_than_point_001_percent_lemma = pickle.load(handle)
    # less_than_5_freq_lemma = pickle.load(handle)

print(less_than_point_001_percent_lemma[0:10])
# print(less_than_5_freq_lemma[0:10])

['calimari', 'amendment', 'complements', 'souvenir', 'brined', 'empties', 'samplers', 'whilst', 'technically', 'failure']

nlp = spacy.load('en_core_web_md', disable=['parser', 'ner'])
nlp.add_pipe('sentencizer')
stopwords = set(nlp.Defaults.stop_words)

custom_stopwords = custom_stopwords.union(lemma_stopwords).union(stopwords).union(set(less_than_point_001_percent_lemma))
print(f'\n\033[1mTotal number of stopwords to be removed: \033[0m {len(custom_stopwords)}')

Total number of stopwords to be removed:  12385

Lemmatization¶

# Refine Lemmatization
def lemmatization(doc, stopwords):
    lemma_list = [str(token.lemma_) for token in doc if 
                 len(token)>2 and 
                 token.lemma_ not in stopwords and token.text not in stopwords] 
    return lemma_list

def lemmatize_lda_pipe(texts, stopwords):
    processed_pipe = []
    for doc in nlp.pipe(texts, batch_size=1000):
        processed_pipe.append(lemmatization(doc, stopwords))
    return processed_pipe

# Lemmatize 
%time reviews['topic_words'] = lemmatize_lda_pipe(reviews["lda_tokens"], custom_stopwords)

CPU times: user 47.1 s, sys: 82 ms, total: 47.1 s
Wall time: 47.2 s

# Lemmatize 
%time reviews['lda_tokens_noun'] = lemmatize_lda_pipe(reviews["lda_tokens_noun"], stopwords)

CPU times: user 29.6 s, sys: 33 ms, total: 29.6 s
Wall time: 29.6 s

reviews.shape

(88860, 9)

reviews.head(3)

LDA: Unsupervised Aspect Extraction from the Reviews¶

LDA is an unsupervised algorithm used to extract topics from the document. In case of topic modelling we have two basic scenarios:

A document may have multiple topics associated with it. For example, single review may express sentiment about multiple aspects of the restaurant such as food, service, staff, drinks etc. and
A word can belong to multiple topics. (For example words like quality may belong to the topics foodand drinks and service, warm applies to both food and ambience) Dirichlet distributions are probability distributions that are continuous and multivariate and can model the relation described above. Application of LDA on the documents generates two sets of information. Topics per document and words per topic. Number of topics is a hyperparameter that is to be learned during the training and is a prerequisite to running the model.

import logging
logging.basicConfig(filename='gensim.log',
                    format='%(asctime)s : %(levelname)s : %(message)s',
                    level=logging.DEBUG)

lda_params = {"num_topics" : 35,
              "alpha" : "auto",
              "eta" : "auto",
              "passes" : 5,
              "iterations": 50,
              "per_word_topics": True}

def get_vocabulary(tokens):
    dictionary_lda = corpora.Dictionary(tokens)
    corpus = [dictionary_lda.doc2bow(tok) for tok in tokens]
    return {#"tokens" : tokens, 
            "dictionary_lda" : dictionary_lda,
            "corpus" : corpus}

def get_lda_model(corpus, dictionary_lda, num_topics = 13, passes = 5, iterations = 50, alpha="auto", eta="auto", per_word_topics=True):
    np.random.seed(49)
    lda_model = models.LdaModel(corpus, num_topics=num_topics, \
                                  id2word=dictionary_lda, \
                                  random_state=49, update_every=1, \
                                  iterations=iterations, passes=passes,\
                                  alpha=alpha, eta=eta, \
                                  per_word_topics=per_word_topics) #alpha=[0.01]*num_topics [0.01]*len(dictionary_LDA.keys())
    return lda_model

def get_model_performance(parameter, cases, texts, lda_params, coherence="c_v"):    
    performance = {parameter:[], "coherence":[]}
    for c in cases:
        lda_params.update({parameter : c})
        performance[parameter].append(c)
        lda_model = get_lda_model(**lda_params)
        coherence_model = CoherenceModel(model=lda_model, texts=texts, dictionary=lda_params["dictionary_lda"], coherence=coherence)
        performance["coherence"].append(coherence_model.get_coherence())
    return performance

texts = reviews['topic_words'].tolist()

lda_params.update(get_vocabulary(texts))
num_topics = [5, 10, 15, 20, 25, 30, 35, 40, 45, 50]

%time performance_num_topics = get_model_performance("num_topics", num_topics, texts, lda_params, "c_v")

CPU times: user 12min 57s, sys: 2.67 s, total: 12min 59s
Wall time: 13min 28s

alpha = [0.01, 0.05, 0.1, 0.5, 1, 1.5, 2, 2.5, 3, 3.5]
lda_params.update({"num_topics" : 35})
%time performance_alpha = get_model_performance("alpha", alpha, texts, lda_params, "c_v")

CPU times: user 13min 56s, sys: 2.26 s, total: 13min 58s
Wall time: 14min 27s

eta = [0.01, 0.03, 0.05, 0.07, 0.1]
lda_params.update({"num_topics" : 35, "alpha": 0.1})
%time performance_eta = get_model_performance("eta", eta, texts, lda_params, "c_v")

CPU times: user 6min 29s, sys: 1.24 s, total: 6min 30s
Wall time: 6min 47s

def plot_performance(x, y, plot_title, x_label, y_label):
    plt.plot(x,y) 
    plt.scatter(x,y)
    plt.title(plot_title)
    plt.xlabel(x_label)
    plt.ylabel(y_label)
    plt.xticks(x)
    # plt.show()

%matplotlib inline
fig = plt.figure(figsize=(16,4))

plt.subplot(1, 3, 1)
plot_performance(performance_num_topics["num_topics"], performance_num_topics["coherence"], 'Number of Topics vs. Coherence', 'Number of Topics', 'Coherence')

plt.subplot(1, 3, 2)
plot_performance(performance_alpha["alpha"], performance_alpha["coherence"], 'Alpha Values vs. Coherence', 'Alpha Values', 'Coherence')

plt.subplot(1, 3, 3)
plot_performance(performance_eta["eta"], performance_eta["coherence"], 'Eta Values vs. Coherence', 'Eta Values', 'Coherence')

plt.tight_layout()
plt.show()

For alpha and eta hyperparameters learned as asymmetric prior using the corpus automatically, we see that the increase in the number of topics beyond the value 35 results in almost constant coherence. Thus we will set the number of topics to 35. For number of topics 35, alpha value 0.1 yields highest coherence. Alpha values typically decides topic-document density, so if it is expected that multiple topics will be distributed over multiple documents, or repharsing: If a document is expected to contain multiple topics, then set this value high. Similarly, eta governs word-topic density, it controls distribution of words over multiple topics. Set this value as low as possible to yield topic seperability. If this value is set high then different topics will contain same words. Thomas L. Griffiths and Mark Steyvers in Finding scientific topics recommend on setting alpha value as ratio $\frac{50}{\text{number of topics}}$ and eta as $0.1$.

lda_params.update(get_vocabulary(texts))
# lda_params.update({"num_topics": 35, "alpha": 0.1, "eta": 0.01})
lda_params.update({"num_topics": 35, "alpha": "auto", "eta": "auto"})
# lda_params.update({"alpha": 50/lda_params["num_topics"], "eta": 200/len(lda_params["dictionary_lda"]) })
%time lda_model = get_lda_model(**lda_params)

CPU times: user 1min 12s, sys: 76 ms, total: 1min 12s
Wall time: 1min 12s

for i,topic in lda_model.show_topics(formatted=True, num_topics=lda_params["num_topics"], num_words=20):
    print(str(i)+": "+ topic)
    print()

0: 0.143*"sauce" + 0.092*"crowd" + 0.083*"pasta" + 0.076*"potato" + 0.048*"rice" + 0.045*"presentation" + 0.044*"ambience" + 0.039*"vegetable" + 0.033*"mushroom" + 0.033*"garlic" + 0.032*"lunch_dinner" + 0.031*"chicken" + 0.022*"spinach" + 0.021*"hamburger" + 0.019*"good_value" + 0.019*"omelette" + 0.018*"tasting" + 0.015*"entrance" + 0.014*"roast" + 0.014*"specialty"

1: 0.280*"wine" + 0.110*"hot" + 0.106*"pancake" + 0.078*"recommendation" + 0.070*"kitchen" + 0.063*"patio" + 0.063*"sweet" + 0.059*"present" + 0.019*"beer_selection" + 0.013*"beverage" + 0.013*"realize" + 0.012*"squash" + 0.011*"noise_level" + 0.009*"originally" + 0.009*"stock" + 0.009*"rib_eye" + 0.008*"busy_night" + 0.008*"outrageous" + 0.007*"oily" + 0.007*"carrot_cake"

2: 0.327*"burger" + 0.100*"spicy" + 0.058*"fancy" + 0.044*"texture" + 0.035*"indoor" + 0.032*"topping" + 0.030*"pub" + 0.026*"ordinary" + 0.024*"directly" + 0.024*"champagne" + 0.022*"pot" + 0.021*"reserve" + 0.019*"medium_rare" + 0.018*"sunny" + 0.015*"picky" + 0.014*"appeal" + 0.014*"chili" + 0.012*"comfy" + 0.011*"herb" + 0.011*"desire"

3: 0.188*"plate" + 0.134*"warm" + 0.063*"charge" + 0.058*"later" + 0.048*"fine_dine" + 0.041*"calamari" + 0.040*"venue" + 0.039*"fare" + 0.036*"salty" + 0.031*"chair" + 0.031*"affordable" + 0.029*"rib" + 0.026*"welcoming" + 0.021*"additional" + 0.019*"hop" + 0.018*"balance" + 0.015*"formal" + 0.014*"lunchtime" + 0.012*"baby" + 0.012*"definite"

4: 0.217*"egg" + 0.180*"fast" + 0.113*"spend" + 0.054*"healthy" + 0.040*"fruit" + 0.040*"portion_size" + 0.036*"farm" + 0.033*"inexpensive" + 0.025*"generally" + 0.023*"walking_distance" + 0.020*"wash" + 0.018*"weekday" + 0.018*"yesterday" + 0.017*"flat" + 0.015*"center" + 0.014*"assume" + 0.014*"entrée" + 0.013*"fish_tacos" + 0.009*"spinach_salad" + 0.007*"thin_crispy"

5: 0.212*"wait" + 0.156*"reservation" + 0.109*"selection" + 0.101*"cocktail" + 0.077*"return" + 0.055*"pay" + 0.040*"list" + 0.035*"efficient" + 0.027*"cooked" + 0.023*"rich" + 0.015*"high_quality" + 0.014*"look_forward" + 0.013*"fresh_ingredient" + 0.011*"sign" + 0.011*"mint" + 0.010*"mom" + 0.010*"friendly_efficient" + 0.007*"friends_family" + 0.007*"extensive" + 0.006*"dined"

6: 0.147*"cook" + 0.138*"fine" + 0.088*"helpful" + 0.067*"cake" + 0.066*"help" + 0.059*"bartender" + 0.052*"hostess" + 0.045*"complaint" + 0.035*"place_packe" + 0.027*"chunk" + 0.026*"heat" + 0.026*"prior" + 0.019*"gravy" + 0.019*"barely" + 0.017*"bone" + 0.017*"sole" + 0.010*"hate" + 0.010*"leisurely" + 0.009*"driver" + 0.009*"combine"

7: 0.660*"place" + 0.126*"area" + 0.064*"expensive" + 0.040*"ambiance" + 0.026*"cost" + 0.007*"possibly" + 0.007*"bathroom" + 0.006*"stick" + 0.006*"japanese" + 0.006*"relatively" + 0.005*"moderate" + 0.005*"wine_beer" + 0.004*"opposite" + 0.003*"buzz" + 0.003*"spread" + 0.003*"old_fashione" + 0.003*"took_minutes" + 0.002*"inviting" + 0.002*"kid_friendly" + 0.002*"justify"

8: 0.284*"atmosphere" + 0.190*"recommend" + 0.133*"fry" + 0.072*"dine" + 0.032*"buffet" + 0.029*"crispy" + 0.026*"chip" + 0.026*"filling" + 0.025*"drop" + 0.024*"greasy" + 0.018*"ate" + 0.017*"melt" + 0.014*"colleague" + 0.010*"curious" + 0.010*"crunchy" + 0.009*"ordering" + 0.008*"bun" + 0.008*"nice_touch" + 0.007*"previously" + 0.006*"consistently_good"

9: 0.384*"eat" + 0.252*"try" + 0.051*"view" + 0.049*"decor" + 0.044*"water" + 0.043*"cheap" + 0.040*"service_excellent" + 0.027*"quality_food" + 0.019*"market" + 0.013*"waitstaff" + 0.011*"european" + 0.006*"stall" + 0.006*"luckily" + 0.006*"shot" + 0.005*"farmer" + 0.004*"places_eat" + 0.004*"prices_high" + 0.004*"business_trip" + 0.004*"numerous" + 0.003*"primarily"

10: 0.548*"order" + 0.126*"cheese" + 0.048*"main" + 0.042*"book" + 0.030*"environment" + 0.029*"relax" + 0.028*"recently" + 0.018*"late_night" + 0.015*"service_slow" + 0.011*"cod" + 0.010*"color" + 0.010*"pleasantly_surprise" + 0.010*"definitely_come" + 0.009*"pudding" + 0.005*"short_walk" + 0.004*"old_world" + 0.004*"opening" + 0.004*"preference" + 0.004*"path" + 0.004*"sticky"

11: 0.268*"taste" + 0.158*"line" + 0.099*"soup" + 0.074*"bowl" + 0.065*"cold" + 0.036*"corn" + 0.026*"select" + 0.023*"soft" + 0.022*"biscuit" + 0.021*"people_watche" + 0.020*"store" + 0.019*"coconut" + 0.018*"asian" + 0.017*"item" + 0.016*"patron" + 0.012*"pure" + 0.010*"exciting" + 0.008*"minutes_later" + 0.007*"tax" + 0.007*"product"

12: 0.325*"visit" + 0.212*"friendly" + 0.099*"quality" + 0.043*"pork" + 0.025*"traditional" + 0.022*"avocado" + 0.021*"years_ago" + 0.020*"corner" + 0.019*"worth_visit" + 0.018*"service_fast" + 0.017*"totally" + 0.014*"juicy" + 0.013*"team" + 0.012*"consistent" + 0.011*"courteous" + 0.010*"blue_cheese" + 0.010*"focus" + 0.009*"settle" + 0.008*"staple" + 0.008*"aware"

13: 0.514*"delicious" + 0.094*"entree" + 0.093*"pretty" + 0.048*"spice" + 0.040*"enjoyable" + 0.024*"roll" + 0.019*"romantic" + 0.018*"satisfy" + 0.017*"downtown" + 0.017*"perfectly_cooke" + 0.013*"building" + 0.012*"blend" + 0.008*"conveniently_locate" + 0.007*"ideal" + 0.007*"date_night" + 0.007*"destination" + 0.007*"end_meal" + 0.005*"calorie" + 0.005*"individual" + 0.003*"portions_huge"

14: 0.261*"price" + 0.179*"seafood" + 0.094*"review" + 0.092*"star" + 0.039*"wall" + 0.034*"money" + 0.028*"establishment" + 0.026*"range" + 0.023*"rating" + 0.021*"bbq" + 0.016*"rude" + 0.015*"refill" + 0.013*"seasoned" + 0.012*"gluten_free" + 0.010*"pace" + 0.009*"bottle_wine" + 0.008*"finally_got" + 0.008*"nut" + 0.008*"unfriendly" + 0.008*"adequate"

15: 0.329*"drink" + 0.098*"german" + 0.083*"salmon" + 0.074*"quickly" + 0.052*"tip" + 0.043*"american" + 0.040*"pricey" + 0.025*"lively" + 0.018*"went_lunch" + 0.017*"likely" + 0.016*"fault" + 0.016*"noise" + 0.015*"advise" + 0.014*"signature" + 0.012*"uncomfortable" + 0.012*"chocolate_cake" + 0.011*"detail" + 0.011*"shout" + 0.010*"connect" + 0.007*"hit_miss"

16: 0.270*"experience" + 0.206*"tasty" + 0.124*"appetizer" + 0.059*"bill" + 0.053*"stuff" + 0.039*"soon" + 0.037*"type" + 0.035*"lack" + 0.035*"free" + 0.029*"reasonable_price" + 0.026*"attention" + 0.017*"service_attentive" + 0.013*"sat_bar" + 0.009*"brew" + 0.006*"décor" + 0.005*"big_deal" + 0.004*"delighted" + 0.003*"police" + 0.002*"ceviche" + 0.002*"accidentally"

17: 0.181*"server" + 0.133*"location" + 0.120*"outside" + 0.107*"diner" + 0.056*"staff_friendly" + 0.047*"indian" + 0.046*"mussel" + 0.043*"professional" + 0.042*"rush" + 0.025*"fairly" + 0.019*"looking_forward" + 0.016*"negative" + 0.014*"scene" + 0.014*"complain" + 0.012*"shopping" + 0.010*"fortunately" + 0.008*"super_friendly" + 0.008*"mini" + 0.007*"sport" + 0.007*"update"

18: 0.214*"meat" + 0.087*"customer" + 0.082*"butter" + 0.077*"dry" + 0.063*"curry" + 0.053*"clam" + 0.046*"polite" + 0.030*"cup" + 0.027*"indian_food" + 0.026*"french_frie" + 0.023*"cheesecake" + 0.021*"vegan" + 0.021*"companion" + 0.020*"people_waite" + 0.019*"arrival" + 0.016*"thoroughly_enjoye" + 0.016*"arugula" + 0.015*"knowledgable" + 0.014*"cab" + 0.012*"intrusive"

19: 0.730*"restaurant" + 0.051*"away" + 0.049*"dining" + 0.024*"pie" + 0.013*"highlight" + 0.009*"seriously" + 0.007*"lighting" + 0.007*"little_bit" + 0.006*"batter" + 0.006*"personable" + 0.006*"complement" + 0.006*"walnut" + 0.005*"repeat" + 0.005*"character" + 0.004*"plus" + 0.004*"hesitate" + 0.004*"energy" + 0.004*"recent_trip" + 0.004*"steam" + 0.004*"starve"

20: 0.350*"meal" + 0.347*"table" + 0.037*"locate" + 0.034*"comfortable" + 0.031*"seating" + 0.024*"crowded" + 0.020*"park" + 0.017*"opinion" + 0.015*"nearby" + 0.012*"wine_selection" + 0.011*"seated_immediately" + 0.011*"sampler" + 0.010*"preparation" + 0.009*"seasonal" + 0.008*"upstairs" + 0.008*"throw" + 0.005*"recent_visit" + 0.004*"interested" + 0.004*"unable" + 0.004*"wheat"

21: 0.280*"bar" + 0.187*"beer" + 0.069*"music" + 0.067*"size" + 0.063*"wine_list" + 0.057*"french" + 0.029*"overprice" + 0.021*"consistently" + 0.017*"south" + 0.013*"mission" + 0.010*"separate" + 0.010*"cucumber" + 0.010*"innovative" + 0.009*"entry" + 0.009*"bistro" + 0.009*"assure" + 0.008*"non_alcoholic" + 0.008*"deliciously" + 0.008*"grilled_cheese" + 0.008*"seasoning"

22: 0.349*"serve" + 0.115*"oyster" + 0.112*"local" + 0.095*"street" + 0.060*"far" + 0.047*"neighborhood" + 0.032*"bean" + 0.029*"buy" + 0.028*"veggie" + 0.024*"definitely_recommend" + 0.021*"freshly" + 0.013*"fully" + 0.013*"joint" + 0.010*"ball" + 0.008*"value_money" + 0.007*"fried" + 0.004*"comparison" + 0.003*"ale" + 0.003*"brand" + 0.002*"standout"

23: 0.132*"fish" + 0.112*"room" + 0.101*"fill" + 0.077*"clean" + 0.064*"tender" + 0.057*"dining_experience" + 0.048*"suggest" + 0.043*"cozy" + 0.030*"lemon" + 0.028*"quesadilla" + 0.026*"suggestion" + 0.020*"moist" + 0.019*"chop" + 0.017*"apparently" + 0.015*"long_day" + 0.012*"chill" + 0.012*"almond" + 0.011*"roasted" + 0.011*"min" + 0.011*"familiar"

24: 0.142*"taco" + 0.136*"coffee" + 0.101*"option" + 0.075*"chef" + 0.056*"sausage" + 0.055*"grill" + 0.050*"tomato" + 0.036*"lamb" + 0.032*"ice_cream" + 0.030*"chocolate" + 0.030*"reasonably_price" + 0.027*"dining_room" + 0.024*"crust" + 0.020*"french_toast" + 0.019*"tuna" + 0.018*"creamy" + 0.015*"visitor" + 0.012*"salt" + 0.012*"vanilla" + 0.009*"eggplant"

25: 0.585*"service" + 0.110*"chicken" + 0.106*"busy" + 0.031*"late" + 0.022*"slow" + 0.020*"block" + 0.019*"definitely_worth" + 0.016*"incredibly" + 0.011*"basically" + 0.009*"jar" + 0.006*"liver" + 0.005*"frankly" + 0.005*"little_slow" + 0.004*"food" + 0.004*"overwhelming" + 0.004*"common" + 0.004*"facility" + 0.004*"cheer" + 0.003*"belly" + 0.003*"crusty"

26: 0.335*"staff" + 0.127*"portion" + 0.102*"waitress" + 0.096*"beef" + 0.064*"parking" + 0.047*"piece" + 0.031*"dip" + 0.030*"pretty_good" + 0.016*"pour" + 0.012*"customer_service" + 0.011*"sadly" + 0.011*"beet" + 0.010*"price_reasonable" + 0.009*"mousse" + 0.009*"cloth" + 0.009*"cotta" + 0.006*"exterior" + 0.005*"crème" + 0.004*"vinegar" + 0.004*"song"

27: 0.091*"bacon" + 0.085*"cuisine" + 0.058*"immediately" + 0.051*"mashed_potatoe" + 0.050*"greet" + 0.049*"host" + 0.045*"interior" + 0.044*"ravioli" + 0.043*"snack" + 0.039*"pepper" + 0.037*"sound" + 0.035*"brown" + 0.034*"value" + 0.024*"handle" + 0.019*"indian_restaurant" + 0.019*"oil" + 0.019*"base" + 0.019*"wide_variety" + 0.017*"wine_glass" + 0.015*"impression"

28: 0.242*"salad" + 0.178*"dessert" + 0.102*"sandwich" + 0.092*"steak" + 0.076*"style" + 0.050*"homemade" + 0.034*"worth_wait" + 0.033*"desert" + 0.030*"easily" + 0.022*"thick" + 0.020*"sample" + 0.014*"apple" + 0.014*"sourdough" + 0.011*"pulled_pork" + 0.010*"online" + 0.010*"pastry" + 0.010*"gnocchi" + 0.009*"sea_bass" + 0.007*"cookie" + 0.007*"colorful"

29: 0.253*"waiter" + 0.190*"stop" + 0.130*"spot" + 0.068*"loud" + 0.061*"glass" + 0.059*"extra" + 0.036*"dress" + 0.027*"service_quick" + 0.019*"basket" + 0.016*"country" + 0.010*"absolute" + 0.010*"scrumptious" + 0.009*"worth_price" + 0.008*"wedge" + 0.008*"sprout" + 0.008*"plain" + 0.007*"onion_ring" + 0.007*"contact" + 0.006*"sitting_outside" + 0.006*"decadent"

30: 0.126*"shrimp" + 0.122*"cafe" + 0.102*"friendly_staff" + 0.078*"guest" + 0.074*"duck" + 0.041*"omelet" + 0.039*"staying_hotel" + 0.028*"dollar" + 0.026*"pricy" + 0.024*"satisfied" + 0.023*"ham" + 0.022*"bar_area" + 0.021*"risotto" + 0.020*"mash" + 0.019*"ago" + 0.018*"late_lunch" + 0.018*"panini" + 0.017*"pro" + 0.017*"taxi" + 0.013*"cheesy"

31: 0.212*"flavor" + 0.159*"bread" + 0.089*"wait_staff" + 0.061*"deliver" + 0.059*"bottle" + 0.035*"onion" + 0.030*"lettuce" + 0.030*"pickle" + 0.028*"include" + 0.023*"section" + 0.022*"wait_minute" + 0.022*"seated" + 0.019*"dining_area" + 0.019*"blow" + 0.018*"naan" + 0.016*"dozen_oyster" + 0.015*"olive" + 0.014*"sit_bar" + 0.012*"milk" + 0.010*"intend"

32: 0.697*"food" + 0.118*"fresh" + 0.068*"pizza" + 0.030*"authentic" + 0.015*"highly_recommende" + 0.009*"comfort_food" + 0.006*"glass_wine" + 0.004*"recipe" + 0.003*"portions_large" + 0.003*"tenderloin" + 0.003*"chewy" + 0.003*"compare" + 0.003*"honest" + 0.003*"air" + 0.003*"orange" + 0.002*"properly" + 0.002*"cheeseburger" + 0.002*"inedible" + 0.002*"wine_list_extensive" + 0.002*"fluffy"

33: 0.113*"pleasant" + 0.095*"space" + 0.067*"starter" + 0.058*"main_course" + 0.053*"pack" + 0.050*"outdoor" + 0.050*"bland" + 0.050*"vibe" + 0.045*"flavour" + 0.037*"outdoor_seate" + 0.032*"wrap" + 0.030*"ice" + 0.027*"tasteless" + 0.027*"district" + 0.021*"overly" + 0.021*"garden" + 0.021*"service_prompt" + 0.015*"burn" + 0.015*"remove" + 0.013*"downside"

34: 0.296*"menu" + 0.220*"dish" + 0.108*"choice" + 0.051*"offer" + 0.043*"variety" + 0.036*"available" + 0.035*"vegetarian" + 0.035*"ingredient" + 0.030*"accommodate" + 0.027*"flavorful" + 0.025*"slice" + 0.018*"limited" + 0.012*"opt" + 0.009*"large_group" + 0.007*"lay" + 0.005*"bread_butter" + 0.005*"organic" + 0.004*"parmesan" + 0.004*"prix_fixe" + 0.004*"valet_parke"

From the topic distribution above we can see LDA binding some interrelared words together:

Topic 1 bringing words like food items (rice, pasta and sauce etc.) with its charcteristics like presentation, roast and speciality etc.
Topic 2 combining bar and beverage related words like wine, beer, selection, beverage similarly Topic 21 containing words like _bar, beer, wine_list_ and bistro.
Topic 5 describing diner's experience like wait (waiting for the food, table) reservation (procedure) selection (menu) return (feedback)
Topic 7 highighting the price aspects like place, expensive and cost.
Topic 12 about charcteristics of the working personnels or staff like _friendly, consistent, courteous, servicefast likewise Topic 26 with words like staff, waitess and _customer_service_ etc.
Topic 28 including words like dessert, desert, apple, pastry and cookie etc to shows reviews about dessert.
Topic 33 describing the ambience of the place with words like _pleasant space outdoor and vibe etc.
Topic 34 representing diner's review that describes menu, choice, offer, variety and available etc. From initial inspection some aspects are clearly coming up from the vocabulary lets define them:

No.	Aspect	Tag	Description	Words
1.	Food	`food`	Words describing the food item and its characteristics.	*Food, Meal, Delicious, Tasty, Portion, Flavor, Vegetarian, Yummy, Seafood, Spicy, Eat, Salty, Tender, Soggy, Snack, Ingredient, Quality Food, Roasted* etc.
2.	Order	`order`	Words describing the order placed at the restaurant.	*Order, Dish, Entree, Dine, Italian, Starter, Bowl, Chef, Plate, Main Course, Appetizer, Snack, Mexican Food, Cuisine, Decided try, Specialty, Platter, Chinese Food* etc.
3.	Service	`service`	Words describing the service offered by the restaurant	*Service, Friendly, Service Excellent, Service Slow, Customer Service, Professionlaism, Management, Service Fast, Helpful, unfriendly, Service Prompt, Service Attentive* etc.
4.	Recommendation	`recommendation`	Words describing recommendation, either given by the diner or others.	*Recommendation, Review, Return, Star, Rating, Suggest, Worth Visit, Definitely Recommend, Highly Recommended*
5.	Bar or Beverage	`bar_beverage`	Words referring to bar, drinks or beverage.	*Beer, Bar, Stout, Champagne, Cocktail, Happy Hour, Juice, Wine, Drink, Coffee, Tea, Glass, Milkshake, Bottle, Lemonade, Wine Selection, Alcohol, Brew, Beverage, Sampler, Refreshing* etc.
6.	Ambience	`ambience`	Words describing surrounding of the restaurant.	*Ambience, Decor, Warm, Busy, Crowd, Scene, Vibe, Neighbourhood, Cozy, Inviting, Welcoming, Relax, Live Music, Sunny, Theme, Loud, Environment, Lively, Interior, Music, View* etc.
7.	Place or Location	`place_location`	Words decribing the restaurant location or nearby places.	*Location, Place, Spot, Space, Street, View, Locate, Corner, Local, Market, Establishment, Section, Joint, Pub, Store, Stop, Clean, Floor*
8.	Experience	`experience`	Words referring to the diner experience or feedback.	*Experience, Wait, visit, Enjoyable, Fault, Long Wait, Spend, Queue, Lost, Welcoming, Great Value, Complain, Lacking, Comfortable, Line, Special Occassion, Rush, Dining Experience* etc.
9.	Dessert	`dessert`	Words decribing sweets or desserts.	*Dessert, Sweet, Ice cream, Cheesecake, Pudding, Pastry, Sugar, Crepe, Mousse, Chocolate, Cake, Pancake, Waffle, Baked, Whipped Cream, Creamy, Pie,* etc.
10.	Price	`price`	Words reffering to the monetary aspects.	*Price, Cheap, Expensive, Bill, cost, Prices Reasonable, Affordable, Overprice, Cash, Pricy, Spend, Good Value, Money, tip, Pay, Inexpensive, Charge, Wort Price* etc.
11.	Menu or Offering	`menu_offering`	Words referring to the variety of food items offered at the restaurant.	*Menu, Dish, Choice, Serve, Buffet, Homemade, Offer, Main Course, Tasting Menu, Option, Selection, Starter, Cuisine, Staple, Vegan, Limited, Snack, Variety*
12.	Staff or Team	`staff_team`	Words relating to the personnels at the restaurant.	*Wait Staff, Waiter, Owner, Chef, Server, Staff, Waitress, Bartender, Hostess, Team, Host, Management, Friendly, Professional, Welcoming, Polite* etc.
13.	Facility	`facility`	Words describing facilities available at the restaurant	*Bar, Dining Room, Cafe, Outdoor, Kitchen, Parking, Garden, Seating, Store, Cooking, Patio, Indoor, Bathroom, Fine Dining, Roof, Terrace, Outdoor Seating, Screen* etc.

print(reviews.text.iloc[0])
print(reviews.lda_tokens.iloc[0])
print(reviews.topic_words.iloc[0])
print()
print(lda_model.get_document_topics(lda_params["corpus"][0]))
max(lda_model.get_document_topics(lda_params["corpus"][0]),key=lambda x: x[1])

They have great local craft beers and probably one of the best pork ribs I had for a while.
great local craft_beers probably_best pork ribs
['local', 'craft_beer', 'pork', 'rib']

[(0, 0.010968736), (3, 0.21069741), (5, 0.01198089), (7, 0.019997349), (8, 0.012120977), (9, 0.018696748), (10, 0.012803727), (12, 0.21762316), (14, 0.010309102), (16, 0.011968576), (17, 0.0105000045), (19, 0.015340003), (20, 0.015096236), (22, 0.2139118), (24, 0.010597683), (25, 0.01674244), (28, 0.0113992095), (32, 0.030919021), (34, 0.017107485)]

(12, 0.21762317)

len(lda_params["dictionary_lda"].keys())

4565

GSDMM For Short Text Topic Modeling¶

For short text GSDMM based on Gibbs sampling generates better results. It requires setting an upper threshold for the maximum number of topics expected k. Learning from the earlier LDA run, we will set this value more than or equal to 13 as we can clearly expect at least 13 topics from the reviews data. Apart from that we will set the optimal values for alpha and beta (eta in LDA) as $0.1$ and $0.01$ respectively, as learned from LDA.

# Create corpora nd dictionary similar to LDA
docs = reviews.topic_words.to_numpy()
dictionary = corpora.Dictionary(docs)
vocab_length = len(dictionary)
bow_corpus = [dictionary.doc2bow(doc) for doc in docs]
# gsdmm_model = MovieGroupProcess(K=13, alpha=0.07, beta=0.07, n_iters=20)
gsdmm_model = MovieGroupProcess(K=15, alpha=0.1, beta=0.01, n_iters=20)

%time y = gsdmm_model.fit(docs, vocab_length)

In stage 0: transferred 80237 clusters with 15 clusters populated
In stage 1: transferred 69895 clusters with 15 clusters populated
In stage 2: transferred 62699 clusters with 15 clusters populated
In stage 3: transferred 57105 clusters with 15 clusters populated
In stage 4: transferred 53174 clusters with 15 clusters populated
In stage 5: transferred 50545 clusters with 15 clusters populated
In stage 6: transferred 49115 clusters with 15 clusters populated
In stage 7: transferred 47929 clusters with 15 clusters populated
In stage 8: transferred 47025 clusters with 15 clusters populated
In stage 9: transferred 46035 clusters with 15 clusters populated
In stage 10: transferred 45712 clusters with 15 clusters populated
In stage 11: transferred 44854 clusters with 15 clusters populated
In stage 12: transferred 44710 clusters with 15 clusters populated
In stage 13: transferred 44080 clusters with 15 clusters populated
In stage 14: transferred 43247 clusters with 15 clusters populated
In stage 15: transferred 42753 clusters with 15 clusters populated
In stage 16: transferred 42594 clusters with 15 clusters populated
In stage 17: transferred 42315 clusters with 15 clusters populated
In stage 18: transferred 42007 clusters with 15 clusters populated
In stage 19: transferred 41882 clusters with 15 clusters populated
CPU times: user 8min 58s, sys: 323 ms, total: 8min 58s
Wall time: 8min 58s

doc_count = np.array(gsdmm_model.cluster_doc_count)
print(f'\033[1mNumber of documents per topic : \033[0m{doc_count}')
topic_clusters_importance = doc_count.argsort()[-15:][::-1]
print(f'\033[1mMost important clusters based on number of documents inside it: \033[0m {topic_clusters_importance}')
def top_words(cluster_word_distribution, top_cluster, values):
    for cluster in top_cluster:
        sort_dicts = sorted(cluster_word_distribution[cluster].items(), key=lambda k: k[1], reverse=True)[:values]
        print("\nCluster %s : %s"%(cluster, sort_dicts))

# Get the top words in topics
top_words(gsdmm_model.cluster_word_distribution, topic_clusters_importance, 30)

Number of documents per topic : [ 9803  4431  3524  7028  5575  7429  3049  2362  4498  3547  6693  8127
  5575  3821 13398]
Most important clusters based on number of documents inside it:  [14  0 11  5  3 10 12  4  8  1 13  9  2  6  7]

Cluster 14 : [('order', 1201), ('salad', 794), ('delicious', 732), ('dish', 683), ('chicken', 671), ('sauce', 602), ('fresh', 488), ('fry', 471), ('cheese', 459), ('serve', 459), ('taste', 443), ('tasty', 421), ('food', 415), ('bread', 394), ('sandwich', 386), ('try', 366), ('soup', 354), ('appetizer', 345), ('crab', 337), ('burger', 333), ('steak', 328), ('potato', 327), ('meat', 326), ('pasta', 313), ('fish', 309), ('shrimp', 295), ('beef', 293), ('egg', 293), ('flavor', 291), ('oyster', 265)]

Cluster 0 : [('food', 2947), ('service', 1755), ('restaurant', 1113), ('place', 892), ('atmosphere', 754), ('price', 364), ('delicious', 284), ('experience', 269), ('location', 251), ('menu', 226), ('meal', 225), ('decor', 214), ('friendly', 198), ('eat', 193), ('ambiance', 191), ('visit', 175), ('diner', 171), ('view', 167), ('italian', 162), ('fast', 160), ('tasty', 155), ('staff', 150), ('quality', 143), ('ambience', 140), ('authentic', 138), ('cheap', 134), ('drink', 126), ('serve', 124), ('area', 119), ('expensive', 114)]

Cluster 11 : [('place', 1380), ('visit', 1038), ('restaurant', 908), ('eat', 657), ('recommend', 498), ('try', 437), ('food', 424), ('experience', 344), ('stop', 318), ('meal', 309), ('return', 273), ('area', 254), ('review', 208), ('location', 162), ('highly_recommend', 157), ('spot', 124), ('service', 117), ('burger', 107), ('recommendation', 105), ('pizza', 104), ('local', 104), ('recently', 104), ('years_ago', 104), ('wait', 92), ('dine', 90), ('price', 83), ('highly_recommende', 81), ('definitely_recommend', 80), ('reservation', 78), ('choice', 76)]

Cluster 5 : [('food', 1370), ('menu', 902), ('dish', 620), ('delicious', 354), ('choice', 354), ('wine', 318), ('selection', 297), ('order', 294), ('try', 274), ('fresh', 269), ('meal', 261), ('taste', 261), ('restaurant', 249), ('beer', 216), ('price', 216), ('option', 202), ('place', 198), ('serve', 194), ('tasty', 178), ('eat', 175), ('drink', 169), ('service', 169), ('vegetarian', 162), ('cocktail', 153), ('offer', 152), ('variety', 148), ('flavor', 127), ('pizza', 126), ('dessert', 125), ('ingredient', 121)]

Cluster 3 : [('restaurant', 901), ('place', 723), ('view', 389), ('area', 372), ('table', 334), ('eat', 282), ('location', 255), ('bar', 241), ('street', 227), ('visit', 211), ('parking', 199), ('atmosphere', 194), ('food', 182), ('locate', 167), ('spot', 156), ('outside', 148), ('floor', 122), ('stop', 120), ('corner', 114), ('wall', 113), ('away', 107), ('park', 105), ('room', 103), ('recommend', 101), ('busy', 100), ('crowd', 97), ('main', 95), ('space', 95), ('meal', 89), ('service', 85)]

Cluster 10 : [('food', 802), ('service', 694), ('order', 565), ('waiter', 523), ('server', 434), ('table', 387), ('drink', 375), ('staff', 342), ('waitress', 297), ('wine', 292), ('friendly', 271), ('menu', 256), ('wait', 232), ('meal', 216), ('restaurant', 178), ('water', 164), ('place', 162), ('serve', 161), ('helpful', 150), ('bar', 136), ('manager', 132), ('dish', 130), ('coffee', 115), ('bill', 108), ('dessert', 108), ('pay', 93), ('help', 93), ('recommendation', 92), ('glass', 88), ('eat', 86)]

Cluster 12 : [('service', 951), ('food', 721), ('friendly', 591), ('staff', 511), ('restaurant', 354), ('place', 277), ('atmosphere', 243), ('table', 242), ('busy', 194), ('server', 189), ('waiter', 175), ('efficient', 152), ('wait', 118), ('professional', 114), ('wait_staff', 112), ('order', 111), ('fast', 105), ('meal', 101), ('bar', 97), ('reservation', 96), ('helpful', 93), ('drink', 92), ('waitress', 92), ('delicious', 82), ('serve', 78), ('visit', 77), ('quickly', 77), ('line', 77), ('warm', 75), ('bartender', 74)]

Cluster 4 : [('table', 986), ('reservation', 595), ('place', 515), ('restaurant', 481), ('wait', 427), ('bar', 315), ('busy', 289), ('eat', 166), ('order', 155), ('food', 151), ('line', 151), ('crowd', 145), ('book', 135), ('outside', 134), ('room', 118), ('drink', 109), ('meal', 107), ('try', 104), ('visit', 91), ('loud', 84), ('queue', 79), ('worth_wait', 79), ('seating', 78), ('stop', 75), ('pack', 73), ('recommend', 72), ('serve', 71), ('staff', 71), ('pretty', 69), ('service', 69)]

Cluster 8 : [('food', 734), ('place', 622), ('service', 409), ('restaurant', 337), ('price', 281), ('eat', 260), ('experience', 238), ('meal', 236), ('visit', 140), ('star', 119), ('expensive', 102), ('burger', 99), ('quality', 99), ('location', 94), ('recommend', 88), ('try', 87), ('cheap', 86), ('serve', 80), ('area', 80), ('drink', 78), ('taste', 77), ('pizza', 74), ('delicious', 73), ('review', 69), ('far', 68), ('atmosphere', 65), ('staff', 63), ('busy', 59), ('bill', 59), ('order', 56)]

Cluster 1 : [('place', 573), ('food', 366), ('restaurant', 308), ('eat', 281), ('beer', 179), ('visit', 172), ('bar', 166), ('table', 151), ('stop', 140), ('drink', 139), ('try', 129), ('area', 124), ('atmosphere', 103), ('order', 102), ('meal', 101), ('outside', 97), ('music', 94), ('staff', 88), ('spot', 84), ('cold', 84), ('local', 79), ('cafe', 73), ('pub', 70), ('experience', 66), ('view', 64), ('warm', 63), ('sandwich', 55), ('burger', 54), ('late_night', 54), ('cocktail', 53)]

Cluster 13 : [('food', 641), ('price', 355), ('menu', 245), ('meal', 241), ('order', 204), ('eat', 179), ('place', 178), ('dish', 151), ('service', 144), ('portion', 138), ('restaurant', 124), ('wine', 122), ('experience', 121), ('expensive', 113), ('quality', 107), ('serve', 96), ('bill', 94), ('pay', 92), ('taste', 92), ('money', 91), ('good_value', 83), ('size', 78), ('plate', 77), ('drink', 70), ('charge', 70), ('tasty', 65), ('delicious', 59), ('waiter', 59), ('burger', 58), ('pretty', 55)]

Cluster 9 : [('burger', 548), ('food', 364), ('order', 279), ('fry', 229), ('delicious', 224), ('fresh', 186), ('place', 180), ('eat', 143), ('coffee', 134), ('try', 120), ('sandwich', 113), ('drink', 111), ('bread', 111), ('meal', 105), ('serve', 99), ('tasty', 97), ('taste', 95), ('cheese', 86), ('menu', 82), ('beer', 81), ('shake', 74), ('fast', 68), ('cheeseburger', 67), ('bacon', 66), ('hamburger', 65), ('price', 58), ('service', 57), ('milkshake', 57), ('stop', 54), ('pancake', 48)]

Cluster 2 : [('dessert', 401), ('pizza', 344), ('delicious', 193), ('order', 171), ('chocolate', 152), ('food', 134), ('coffee', 131), ('taste', 115), ('serve', 110), ('fresh', 110), ('cake', 109), ('eat', 107), ('crust', 106), ('try', 104), ('ice_cream', 103), ('cheese', 87), ('bread', 86), ('meal', 85), ('hot', 85), ('salad', 83), ('sauce', 82), ('flavor', 82), ('dish', 81), ('pie', 77), ('tasty', 71), ('plate', 70), ('pasta', 68), ('yummy', 68), ('sweet', 64), ('place', 63)]

Cluster 6 : [('order', 341), ('salad', 310), ('try', 147), ('delicious', 144), ('chicken', 138), ('pizza', 137), ('appetizer', 130), ('food', 124), ('drink', 122), ('sandwich', 114), ('pasta', 110), ('meal', 106), ('seafood', 100), ('dish', 98), ('fresh', 90), ('menu', 85), ('soup', 84), ('steak', 75), ('entree', 75), ('dessert', 74), ('cocktail', 65), ('sauce', 60), ('eat', 59), ('tasty', 58), ('shrimp', 57), ('recommend', 55), ('salmon', 54), ('mussel', 54), ('place', 48), ('bar', 45)]

Cluster 7 : [('taco', 200), ('order', 171), ('food', 167), ('eat', 113), ('burrito', 109), ('dish', 103), ('chicken', 92), ('restaurant', 87), ('delicious', 84), ('try', 82), ('place', 79), ('taste', 77), ('meal', 73), ('guacamole', 64), ('fresh', 60), ('chip', 56), ('burger', 49), ('tasty', 49), ('cash', 49), ('beef', 48), ('waiter', 47), ('sauce', 43), ('pho', 43), ('steak', 42), ('meat', 40), ('bowl', 40), ('salsa', 40), ('lamb', 38), ('menu', 37), ('serve', 36)]

From the results above we can see some meaningful clusters forming such as:

Cluster 14 and Cluster 6 gathering words like order, dish, try, entree, and appetizer etc.
Cluster 9 forming group of words like food delicious, taste etc.
Cluster 2 grouping dessert _chocolate, cake, icecream etc.
Cluster 13 grouping price pay, bill, money etc.
Cluster 10 and Cluster 12 grouping words like service _friendly, staff, waitstaff, waiter and server etc.
Cluster 10 contains words related to both bar and beverage and staff, we see overlapping of topics here.
Cluster 1, Cluster 3 and Cluster 4 grouping words for location and ambience like place, view, crowd, seating, room and outside.
Cluster 11 grouping words like _recommend, recommendation, highly_recommended_
We see that the cluster with documents reffering to the order aspect is the prominent cluster based on the number of the documents. which is followed by the clusters of food and location.

print(f'\033[1mTopic distribution for a document \033[0m', end="\n\n")
print(reviews.topic_words.iloc[0])
print(gsdmm_model.score(docs[0]))

Topic distribution for a document 

['local', 'craft_beer', 'pork', 'rib']
[2.819713903173849e-08, 6.131882887865567e-07, 2.1842745066273562e-05, 6.497893199201024e-10, 7.626598450707002e-08, 0.003662625161908302, 1.665933796453669e-06, 0.00042373617308201207, 0.00043159084869812163, 6.191472880226237e-05, 2.4266243936245693e-10, 2.0588133904175825e-09, 2.8642808264972745e-13, 2.0473996324071844e-07, 0.9953956990657196]

# # Assign each document to most prevalent topic
# reviews["topic_gsdmm"] = y
# reviews['aspect_gsdmm'] = reviews["topic_gsdmm"].map({0:'food', 1:'bar_beverage', 2:'price', 3:'dessert', 4:'food', 5:'experience', 6:'menu_offering',
#                                                     7:'recommendation', 8:'service', 9:'place_location', 10:'ambience', 11:'staff_team', 12:'order'})

# reviews.head(30)

# reviews_lda[["id", "text", "topic_words", "aspect", "topic_gsdmm", "aspect_gsdmm"]].to_csv("data/final_aspect.csv", index = False)

Semi Supervised LDA¶

Create a matrix where each row represent the topic and the columns are words.
Initially fill all the cell values with ratio: $\frac{1}{\text{number of topics}}$
In the matrix assign a really large value to cell where the word specific to the topic appears.
This will in a way assign higer probability to that word for that topic.

eta_mat = np.full((lda_params["num_topics"], len(lda_params["dictionary_lda"])), fill_value=(1))
for word, topic in seed_words.items():
    word_index = [key for key,term in lda_params["dictionary_lda"].items() if term==word] 
    if (len(word_index)>0):
        eta_mat[topic, word_index[0]] = 1e10  
    # Divide each value with the sum total of values for the topic to make it probability
    eta_mat = np.divide(eta_mat, eta_mat.sum(axis=0))

# Use prior probability matrix as the new eta
lda_params.update({"eta" : eta_mat, "alpha": "auto"})
lda_params["eta"]

array([[7.69230769e-02, 7.69230769e-12, 1.00000000e+00, ...,
        7.69230769e-02, 7.69230769e-02, 7.69230769e-02],
       [7.69230769e-02, 7.69230769e-12, 7.69230769e-12, ...,
        7.69230769e-02, 7.69230769e-02, 7.69230769e-02],
       [7.69230769e-02, 7.69230769e-12, 7.69230769e-12, ...,
        7.69230769e-02, 7.69230769e-02, 7.69230769e-02],
       ...,
       [7.69230769e-02, 1.00000000e+00, 7.69230769e-12, ...,
        7.69230769e-02, 7.69230769e-02, 7.69230769e-02],
       [7.69230769e-02, 7.69230769e-12, 7.69230769e-12, ...,
        7.69230769e-02, 7.69230769e-02, 7.69230769e-02],
       [7.69230769e-02, 7.69230769e-12, 7.69230769e-12, ...,
        7.69230769e-02, 7.69230769e-02, 7.69230769e-02]])

%time lda_model_prior = get_lda_model(**lda_params)

CPU times: user 1min 12s, sys: 108 ms, total: 1min 12s
Wall time: 1min 12s

for i,topic in lda_model_prior.show_topics(formatted=True, num_topics=lda_params["num_topics"], num_words=30):
    print(str(i)+": "+ topic)
    print()

0: 0.065*"salad" + 0.040*"taco" + 0.038*"coffee" + 0.037*"cheese" + 0.035*"burrito" + 0.032*"appetizer" + 0.030*"crab" + 0.030*"meat" + 0.026*"spot" + 0.023*"pasta" + 0.021*"soup" + 0.021*"pancake" + 0.019*"dine" + 0.018*"shrimp" + 0.016*"bowl" + 0.015*"recommendation" + 0.014*"yummy" + 0.014*"tomato" + 0.012*"scallop" + 0.011*"lobster" + 0.011*"clam_chowder" + 0.009*"later" + 0.009*"prawn" + 0.008*"corn" + 0.008*"guacamole" + 0.008*"bean" + 0.008*"grit" + 0.008*"sandwich" + 0.007*"pork_chop" + 0.007*"crab_cake"

1: 0.246*"place" + 0.226*"restaurant" + 0.137*"eat" + 0.115*"meal" + 0.021*"street" + 0.021*"review" + 0.014*"pay" + 0.009*"piece" + 0.009*"mexican_food" + 0.007*"easily" + 0.007*"hostess" + 0.007*"market" + 0.006*"crispy" + 0.006*"filling" + 0.006*"greasy" + 0.006*"corner" + 0.004*"normally" + 0.004*"sound" + 0.004*"platter" + 0.004*"nearly" + 0.004*"prior" + 0.004*"topping" + 0.003*"downtown" + 0.003*"hard_find" + 0.003*"decided_try" + 0.003*"negative" + 0.003*"pricy" + 0.003*"sign" + 0.003*"high_end" + 0.003*"wanted_try"

2: 0.115*"experience" + 0.083*"recommend" + 0.081*"chicken" + 0.065*"sauce" + 0.056*"line" + 0.042*"expensive" + 0.025*"available" + 0.021*"accommodate" + 0.020*"spicy" + 0.019*"late" + 0.017*"cost" + 0.016*"lamb" + 0.016*"rush" + 0.016*"starter" + 0.015*"environment" + 0.014*"relax" + 0.014*"cozy" + 0.012*"dining_room" + 0.012*"cooked" + 0.012*"definitely_worth" + 0.012*"immediately" + 0.012*"butter" + 0.011*"matter" + 0.010*"host" + 0.010*"lemon" + 0.009*"outdoor_seate" + 0.008*"straight" + 0.007*"spinach" + 0.007*"consistently" + 0.007*"romantic"

3: 0.451*"food" + 0.119*"delicious" + 0.097*"dish" + 0.020*"option" + 0.019*"fresh" + 0.017*"cheap" + 0.015*"ingredient" + 0.015*"music" + 0.013*"staff_friendly" + 0.013*"ambience" + 0.012*"present" + 0.011*"spice" + 0.010*"cake" + 0.010*"help" + 0.008*"slow" + 0.007*"establishment" + 0.007*"pretty_good" + 0.005*"totally" + 0.004*"wine_selection" + 0.004*"high_quality" + 0.004*"visitor" + 0.004*"absolutely_delicious" + 0.004*"european" + 0.004*"pop" + 0.004*"pub" + 0.004*"equally" + 0.004*"satisfied" + 0.003*"huge_portion" + 0.003*"blue_cheese" + 0.003*"lay"

4: 0.135*"visit" + 0.089*"drink" + 0.077*"wait" + 0.073*"wine" + 0.057*"seafood" + 0.052*"burger" + 0.034*"fish" + 0.029*"cook" + 0.022*"water" + 0.020*"service_excellent" + 0.017*"wine_list" + 0.016*"sweet" + 0.016*"glass" + 0.013*"happy_hour" + 0.012*"bottle" + 0.011*"pricey" + 0.011*"traditional" + 0.009*"fancy" + 0.009*"avocado" + 0.008*"fare" + 0.008*"fairly" + 0.007*"curry" + 0.007*"salty" + 0.007*"snack" + 0.007*"tea" + 0.007*"freshly" + 0.006*"banana" + 0.006*"juice" + 0.006*"complimentary" + 0.006*"service_slow"

5: 0.076*"busy" + 0.075*"beer" + 0.056*"plate" + 0.052*"local" + 0.048*"diner" + 0.038*"highly_recommend" + 0.031*"ambiance" + 0.029*"grill" + 0.027*"main" + 0.027*"stuff" + 0.019*"soon" + 0.018*"wall" + 0.017*"garlic" + 0.017*"american" + 0.016*"reasonably_price" + 0.015*"pie" + 0.014*"fine_dine" + 0.014*"block" + 0.014*"vibe" + 0.014*"park" + 0.011*"definitely_recommend" + 0.011*"select" + 0.011*"load" + 0.010*"nearby" + 0.010*"sample" + 0.009*"root" + 0.009*"affordable" + 0.009*"pepper" + 0.009*"decide" + 0.008*"southern"

6: 0.123*"serve" + 0.069*"dessert" + 0.063*"fresh" + 0.045*"quality" + 0.032*"offer" + 0.030*"style" + 0.029*"german" + 0.027*"variety" + 0.023*"bill" + 0.022*"quickly" + 0.021*"far" + 0.020*"pork" + 0.019*"spend" + 0.019*"rice" + 0.016*"vegetable" + 0.014*"charge" + 0.014*"efficient" + 0.014*"highly_recommende" + 0.013*"enjoyable" + 0.013*"free" + 0.013*"simply" + 0.012*"buffet" + 0.012*"chocolate" + 0.012*"recently" + 0.011*"clam" + 0.010*"complaint" + 0.009*"healthy" + 0.009*"range" + 0.008*"floor" + 0.008*"texture"

7: 0.216*"order" + 0.128*"try" + 0.057*"pizza" + 0.050*"stop" + 0.041*"location" + 0.037*"cocktail" + 0.036*"portion" + 0.034*"crowd" + 0.026*"fill" + 0.019*"size" + 0.017*"patio" + 0.015*"list" + 0.014*"cuisine" + 0.014*"duck" + 0.013*"type" + 0.011*"money" + 0.010*"bland" + 0.009*"years_ago" + 0.009*"veggie" + 0.009*"flavour" + 0.009*"queue" + 0.008*"main_course" + 0.007*"late_night" + 0.006*"portion_size" + 0.006*"highlight" + 0.005*"create" + 0.005*"patron" + 0.005*"fresh_ingredient" + 0.005*"great_value" + 0.005*"wait_minute"

8: 0.146*"menu" + 0.140*"table" + 0.058*"area" + 0.053*"choice" + 0.032*"outside" + 0.025*"hot" + 0.024*"return" + 0.024*"pretty" + 0.024*"warm" + 0.020*"cafe" + 0.019*"away" + 0.019*"dining" + 0.017*"vegetarian" + 0.017*"clean" + 0.016*"space" + 0.016*"parking" + 0.014*"comfortable" + 0.014*"french" + 0.014*"crepe" + 0.010*"strawberry" + 0.010*"option" + 0.009*"manager" + 0.009*"pack" + 0.008*"cream" + 0.008*"outdoor" + 0.007*"greet" + 0.006*"rating" + 0.006*"waffle" + 0.006*"roll" + 0.006*"suggestion"

9: 0.127*"atmosphere" + 0.077*"staff" + 0.072*"reservation" + 0.036*"fast" + 0.035*"entree" + 0.035*"chef" + 0.034*"fine" + 0.034*"beef" + 0.032*"authentic" + 0.023*"kitchen" + 0.022*"homemade" + 0.020*"customer" + 0.019*"flavorful" + 0.019*"cash" + 0.018*"pleasant" + 0.018*"seating" + 0.017*"quality_food" + 0.016*"deliver" + 0.015*"worth_wait" + 0.015*"desert" + 0.012*"reasonable_price" + 0.012*"prices_reasonable" + 0.010*"rich" + 0.010*"beautifully" + 0.009*"quesadilla" + 0.009*"ravioli" + 0.009*"highly" + 0.008*"tasting" + 0.008*"lively" + 0.008*"creamy"

10: 0.084*"tasty" + 0.056*"fry" + 0.043*"oyster" + 0.040*"egg" + 0.034*"star" + 0.031*"sandwich" + 0.030*"view" + 0.028*"italian" + 0.028*"room" + 0.025*"sausage" + 0.022*"cold" + 0.021*"locate" + 0.019*"book" + 0.018*"neighborhood" + 0.018*"bacon" + 0.017*"slice" + 0.017*"mussel" + 0.016*"dry" + 0.015*"combination" + 0.014*"mushroom" + 0.014*"ice_cream" + 0.014*"tasting_menu" + 0.013*"crowded" + 0.011*"buy" + 0.011*"drop" + 0.010*"special_occasion" + 0.010*"venue" + 0.009*"worth_visit" + 0.009*"comfort_food" + 0.009*"service_fast"

11: 0.092*"bar" + 0.092*"taste" + 0.052*"flavor" + 0.047*"selection" + 0.039*"bread" + 0.039*"steak" + 0.034*"waitress" + 0.034*"potato" + 0.027*"salmon" + 0.023*"friendly_staff" + 0.021*"loud" + 0.020*"owner" + 0.020*"presentation" + 0.019*"extra" + 0.018*"fresh" + 0.017*"tip" + 0.015*"suggest" + 0.014*"lunch_dinner" + 0.012*"limited" + 0.011*"dress" + 0.011*"crust" + 0.010*"tender" + 0.010*"overprice" + 0.009*"hamburger" + 0.009*"friendly_helpful" + 0.008*"service_quick" + 0.008*"friendly_attentive" + 0.008*"fruit" + 0.007*"definitely_return" + 0.007*"farm"

12: 0.273*"service" + 0.083*"friendly" + 0.078*"price" + 0.063*"waiter" + 0.052*"server" + 0.031*"staff" + 0.023*"decor" + 0.017*"wait_staff" + 0.016*"helpful" + 0.014*"dining_experience" + 0.014*"indian" + 0.014*"guest" + 0.012*"professional" + 0.011*"lack" + 0.011*"bartender" + 0.008*"attention" + 0.008*"polite" + 0.007*"cooking" + 0.007*"opinion" + 0.007*"interior" + 0.007*"pleasant" + 0.006*"place_packe" + 0.006*"picture" + 0.006*"waitstaff" + 0.006*"setting" + 0.006*"service_attentive" + 0.005*"indoor" + 0.005*"chat" + 0.005*"welcoming" + 0.005*"refill"

print(reviews.text.iloc[0])
print(reviews.topic_words.iloc[0], end="\n\n")

print(f'\033[1mTopic distribution for a document \033[0m', end="\n\n")
print(lda_model_prior.get_document_topics(lda_params["corpus"][0]))

print(f'\033[1mTopic distribution for a document \033[0m', end="\n\n")
max(lda_model_prior.get_document_topics(lda_params["corpus"][0]),key=lambda x: x[1])

They have great local craft beers and probably one of the best pork ribs I had for a while.
['local', 'craft_beer', 'pork', 'rib']

Topic distribution for a document 

[(0, 0.176821), (1, 0.053394902), (2, 0.02708909), (3, 0.046850376), (4, 0.035473544), (5, 0.16201416), (6, 0.16783461), (7, 0.0371451), (8, 0.038295973), (9, 0.16470061), (10, 0.02754231), (11, 0.027852476), (12, 0.034985825)]
Topic distribution for a document

(0, 0.1768202)

# Assign each document to most prevalent topic
reviews["topic_sslda"] = [max(p,key=lambda item: item[1])[0] for p in lda_model_prior.get_document_topics(lda_params["corpus"])]
reviews['aspect_sslda'] = reviews["topic_sslda"].map(topic_map)

# Assign empty topic words with miscellaneous aspect
reviews.aspect_sslda = np.where(reviews['topic_words'].map(lambda d: len(d)) == 0, 'miscellaneous', reviews.aspect_sslda)

reviews.head()

texts = reviews['topic_words'].tolist()
coherence_model_lda = CoherenceModel(model=lda_model, texts=texts, dictionary=lda_params["dictionary_lda"], coherence="c_v")
coherence_model_lda_prior = CoherenceModel(model=lda_model_prior, texts=texts, dictionary=lda_params["dictionary_lda"], coherence="c_v")
print(coherence_model_lda.get_coherence())
print(coherence_model_lda_prior.get_coherence())

0.4152760754949932
0.4137981562499141

Visualise Word-Aspect Dynamics¶

vis = pyLDAvis.gensim_models.prepare(topic_model=lda_model, corpus=lda_params["corpus"], dictionary=lda_params["dictionary_lda"])
pyLDAvis.enable_notebook()
pyLDAvis.display(vis)

vis_sslda = pyLDAvis.gensim_models.prepare(topic_model=lda_model_prior, corpus=lda_params["corpus"], dictionary=lda_params["dictionary_lda"])
pyLDAvis.enable_notebook()
pyLDAvis.display(vis_sslda)

Corex Topic Model¶

def get_corex_model(tokens, num_topics = 30, iterations=200, anchor_words=None, anchor_strength=2):
    bow_cv = CountVectorizer(binary=True)
    
    # Matrix of shape documents x vocabulary words
    doc_word_mat = bow_cv.fit_transform(tokens)
    doc_word_mat = ss.csr_matrix(doc_word_mat)
    
    # Get words in the vocabulary
    vocab_words = list(np.asarray(bow_cv.get_feature_names_out()))
    
    # Remove numeric tokens
    valid_idxs = [idx for idx,word in enumerate(vocab_words) if not word.isdigit()]
    doc_word_mat = doc_word_mat[:,valid_idxs]
    vocab_words = [word for idx,word in enumerate(vocab_words) if not word.isdigit()]
    
    # Train the CorEx topic model with 13 topics
    corex_model = ct.Corex(n_hidden=num_topics, words=vocab_words, max_iter=iterations, verbose=False, seed=1)
    # corex_model.fit(doc_word_mat, words=vocab_words)
    corex_model.fit(doc_word_mat, words=vocab_words, anchors=anchor_words, anchor_strength=anchor_strength)
    return corex_model

%time corex_model = get_corex_model(tokens=reviews["lda_tokens_noun"].map(lambda x: " ".join(x)))

CPU times: user 1min 58s, sys: 3.37 s, total: 2min 1s
Wall time: 2min 1s

# Print a single topic from CorEx topic model
corex_model.get_topics(topic=1, n_words=10)

[('view', 0.013317617981514925, 1.0),
 ('parking', 0.009973314561162296, 1.0),
 ('street', 0.008774719585364524, 1.0),
 ('wall', 0.008248897553484987, 1.0),
 ('bay', 0.006653456403565911, 1.0),
 ('car', 0.006454199021905457, 1.0),
 ('window', 0.005710091327546172, 1.0),
 ('floor', 0.005150869992929734, 1.0),
 ('day', 0.0043618592782690745, 1.0),
 ('walk', 0.004276011734167163, 1.0)]

# Topic words list 
for n,t in enumerate(corex_model.get_topics(n_words=10)):
    topic_words,_,_ = zip(*t)
    print('{}: '.format(n) + ', '.join(topic_words))

0: chicken, sauce, cheese, cake, egg, crab, beef, tomato, potato, bacon
1: view, parking, street, wall, bay, car, window, floor, day, walk
2: salad, bread, chowder, cream, clam, ice, bowl, soup, pizza, butter
3: wine, list, beer, selection, price, cocktail, quality, pairing, drink, range
4: pork, chop, rice, mouth, curry, prix, belly, fixe, chutney, naan
5: fish, chip, taco, salsa, meat, tortilla, guacamole, oyster, steak, seafood
6: friend, business, family, couple, group, birthday, reservation, night, party, week
7: burger, fry, shake, cheeseburger, fries, milkshake, blueberry, hamburger, pancake, buffalo
8: restaurant, dining, decor, hotel, music, atmosphere, area, level, neighborhood, vibe
9: food, service, notch, customer, comfort, smile, twist, pace, delivery, house
10: table, minute, wait, bar, staff, people, kitchen, manager, space, min
11: dish, portion, size, lamb, variety, taste, presentation, plate, signature, entree
12: glass, bottle, style, refill, tap, woman, champagne, recipe, cafeteria, paper
13: dinner, lunch, year, times, weekend, afternoon, stop, anniversary, vacation, today
14: menu, choice, option, offering, eater, palate, bird, thank, page, substance
15: coffee, cup, croissant, bakery, farmer, cappuccino, machine, highlight, tiramisu, bite
16: flavor, ingredient, combination, crepe, visit, pudding, flavours, difference, freshness, spot
17: breakfast, trip, town, brunch, morning, buffet, fare, money, memory, advisor
18: thing, light, word, case, card, cash, wood, plenty, credit, smoke
19: order, chef, eye, cook, request, cooking, detail, lack, board, shame
20: place, dessert, location, review, experience, star, desert, beignet, island, cannoli
21: kid, water, son, date, cuisine, girlfriend, sister, ticket, club, center
22: home, half, treat, grill, bag, cookie, law, goodness, calorie, pound
23: course, line, bit, owner, special, local, hostess, gem, patron, version
24: waiter, hour, burrito, fan, set, man, world, steakhouse, delish, medium
25: recommendation, country, expectation, jam, balance, self, like, screen, bed, blood
26: city, game, occasion, venue, holiday, offer, delight, chain, appetite, blue
27: guest, life, joint, tasting, employee, charm, sushi, phone, energy, manner
28: dollar, disappointment, head, moment, school, band, advice, tax, liquor, welcome
29: dosa, share, array, basis, influence, performance, parent, scale, magic, supermarket

print(corex_model.p_y_given_x.shape) # documents x num_topics

(88860, 30)

corex_model.p_y_given_x[0]

array([0.0141763 , 0.01708185, 0.01747113, 0.999999  , 0.5118539 ,
       0.01223443, 0.06536739, 0.02546123, 0.1384801 , 0.21413672,
       0.15150906, 0.32741763, 0.0800324 , 0.22564542, 0.16832821,
       0.11315755, 0.15041531, 0.29882636, 0.20735154, 0.29088293,
       0.61161995, 0.36969194, 0.24032308, 0.4932568 , 0.57977441,
       0.66291989, 0.56844168, 0.46856914, 0.43192847, 0.4161131 ])

corex_model.labels[0]

array([False, False, False,  True,  True, False, False, False, False,
       False, False, False, False, False, False, False, False, False,
       False, False,  True, False, False, False,  True,  True,  True,
       False, False, False])

Semisupervised Corex Model¶

seed_words_list = dict(sorted(seed_words.items(), key=lambda item: item[1]))
anchor_words = []
for val in set(seed_words_list.values()):
    anchor_words.append([k for k,v in seed_words_list.items() if v == val])
len(anchor_words)

13

%time corex_model_guided = get_corex_model(tokens=reviews["topic_words"].map(lambda x: " ".join(x)), num_topics=13, anchor_words=anchor_words)

CPU times: user 13 s, sys: 3.99 ms, total: 13 s
Wall time: 13 s

# Print a single topic from CorEx topic model
corex_model_guided.get_topics(topic=1, n_words=10)

[('food', 0.5286111625229623, 1.0),
 ('order', 0.17983986120479808, 1.0),
 ('eat', 0.14904108211257486, 1.0),
 ('try', 0.10469009639163514, 1.0),
 ('appetizer', 0.03527090500142678, 1.0),
 ('mexican_food', 0.005905144897553775, 1.0),
 ('cup', 0.0049671299683168475, 1.0),
 ('portion_size', 0.0041282533808710865, 1.0),
 ('specialty', 0.003832269391855213, 1.0),
 ('platter', 0.0037829432950495655, 1.0)]

# Topic words list 
for n,t in enumerate(corex_model_guided.get_topics(n_words=10)):
    topic_words,_,_ = zip(*t)
    print('{}: '.format(n) + ', '.join(topic_words))

0: salad, chicken, sauce, sandwich, fry, bread, cheese, pasta, steak, meat
1: food, order, eat, try, appetizer, mexican_food, cup, portion_size, specialty, platter
2: service, fast, service_excellent, slow, service_attentive, service_quick, service_fast, chair, highlight, service_slow
3: recommend, review, return, suggest, definitely_recommend, worth_visit, rating, concierge, strongly, look_forward
4: delicious, drink, fresh, taste, tasty, wine, beer, coffee, quality, cocktail
5: atmosphere, crowd, warm, decor, cold, pleasant, music, ambiance, ambience, loud
6: place, restaurant, area, location, stop, busy, view, diner, street, water
7: table, visit, experience, wait, reservation, outside, plate, line, clean, dine
8: dessert, sweet, pancake, cake, flavorful, chocolate, slice, desert, pie, flavour
9: price, expensive, cheap, pay, bill, tip, spend, cost, money, charge
10: menu, dish, serve, choice, flavor, selection, option, local, seafood, italian
11: staff, friendly, waiter, server, waitress, offer, chef, owner, recommendation, staff_friendly
12: bar, spot, room, main, cafe, available, kitchen, space, dining, parking

print(corex_model_guided.p_y_given_x.shape) # documents x num_topics

(88860, 13)

corex_model_guided.p_y_given_x[0]

array([0.999999  , 0.02505121, 0.00346763, 0.0040276 , 0.02074459,
       0.00877987, 0.02174301, 0.02028681, 0.00453011, 0.00693708,
       0.999999  , 0.00936147, 0.00825129])

corex_model_guided.labels[0]

array([ True, False, False, False, False, False, False, False, False,
       False,  True, False, False])

# No topic assigned to the document
print(f'\033[1mReview irrelevant to any topic : \033[0m{reviews.loc[123,"text"]}')
print(f'\033[1mNot assigned to any topic? : \033[0m{(~corex_model_guided.labels[123]).all()}')

Review irrelevant to any topic : Sean also bought one of their plaid shirts with 21st Amendment emblazoned above the pocket.
Not assigned to any topic? : True

# reviews["aspect_corex"] = [topic_list[np.argmax(prob)] for prob in corex_model.p_y_given_x]
reviews["aspect_corex"] = [np.array(topic_list)[t][0] if t.any() else 'miscellaneous' for t in corex_model_guided.labels]

# Assign empty topic words with miscellaneous aspect
reviews.aspect_corex = np.where(reviews['topic_words'].map(lambda d: len(d)) == 0, 'miscellaneous', reviews.aspect_corex)

reviews.head(3)

# reviews.to_csv("data/corex_model_aspect.csv", index=False)

# Two aspects in same sentence
print(reviews.text[407])
np.array(topic_list)[corex_model_guided.labels[407]]

the helpful waitress spoke knowledgeably about the menu and we were pleased with our choices.

array(['menu_offering', 'staff_team'], dtype='<U14')

display(reviews[["text", "topic_words", "aspect_corex"]].loc[403:411])

reviews[["restaurant_id", "review_title", "text", "topic_words", "aspect_corex"]].loc[np.r_[191:196, 77:87, 133:139]]

	name	review_title	review_text
0	21st Amendment Brewery & Restaurant	Great drinks and food	They have great local craft beers and probably...
1	21st Amendment Brewery & Restaurant	Good food & beer	We went to the downtown SF location. The resta...
2	21st Amendment Brewery & Restaurant	Pretty good beers	I just came to this place for drinks with an o...
3	21st Amendment Brewery & Restaurant	Ridiculously overpriced (yes I live in SF)	Mediocre food (not bad, just mediocre, you can...
4	21st Amendment Brewery & Restaurant	Team dinner	We headed out for our team dinner to this esta...

	name	review_title	review_text	text
0	21st Amendment Brewery & Restaurant	Great drinks and food	They have great local craft beers and probably...	They have great local craft beers and probably...
1	21st Amendment Brewery & Restaurant	Great drinks and food	They have great local craft beers and probably...	I also tried some of their pizza and that was ...
2	21st Amendment Brewery & Restaurant	Great drinks and food	They have great local craft beers and probably...	Can't comment on the prices, as my work was pa...
3	21st Amendment Brewery & Restaurant	Good food & beer	We went to the downtown SF location. The resta...	We went to the downtown SF location.
4	21st Amendment Brewery & Restaurant	Good food & beer	We went to the downtown SF location. The resta...	The restaurant was really clean and servers we...

	id	name	review_title	review_text	text
413	0	21st Amendment Brewery & Restaurant	Pleasant, but a bit of a backwater on a midwee...	We set out to visit this bar for midweek lunch...	The service was very good;
414	0	21st Amendment Brewery & Restaurant	Pleasant, but a bit of a backwater on a midwee...	We set out to visit this bar for midweek lunch...	the helpful waitress spoke knowledgeably about...

	review_id	name	review_title	review_text	text	lda_tokens	lda_tokens_noun
0	0	21st Amendment Brewery & Restaurant	Great drinks and food	They have great local craft beers and probably...	They have great local craft beers and probably...	great local craft beers probably best pork ribs	craft beers pork ribs
1	1	21st Amendment Brewery & Restaurant	Great drinks and food	They have great local craft beers and probably...	I also tried some of their pizza and that was ...	tried pizza good	pizza
2	2	21st Amendment Brewery & Restaurant	Great drinks and food	They have great local craft beers and probably...	Can't comment on the prices, as my work was pa...	comment prices work paying night overall great...	prices work night place food drinks
3	3	21st Amendment Brewery & Restaurant	Good food & beer	We went to the downtown SF location. The resta...	We went to the downtown SF location.	went downtown location	downtown location
4	4	21st Amendment Brewery & Restaurant	Good food & beer	We went to the downtown SF location. The resta...	The restaurant was really clean and servers we...	restaurant clean servers nice	restaurant servers
5	5	21st Amendment Brewery & Restaurant	Good food & beer	We went to the downtown SF location. The resta...	Foods were great!	foods great	foods
6	8	21st Amendment Brewery & Restaurant	Good food & beer	We went to the downtown SF location. The resta...	We ordered a flight to taste 6 different beers!	ordered flight taste different beers	flight beers
7	9	21st Amendment Brewery & Restaurant	Good food & beer	We went to the downtown SF location. The resta...	We enjoyed their brown ale and stout!	enjoyed brown ale stout	ale stout

	review_id	name	review_title	review_text	text	lda_tokens	lda_tokens_noun
0	0	21st Amendment Brewery & Restaurant	Great drinks and food	They have great local craft beers and probably...	They have great local craft beers and probably...	great local craft_beers probably_best pork ribs	craft beers pork ribs
1	1	21st Amendment Brewery & Restaurant	Great drinks and food	They have great local craft beers and probably...	I also tried some of their pizza and that was ...	tried pizza good	pizza
2	2	21st Amendment Brewery & Restaurant	Great drinks and food	They have great local craft beers and probably...	Can't comment on the prices, as my work was pa...	comment prices work paying night overall great...	prices work night place food drinks
3	3	21st Amendment Brewery & Restaurant	Good food & beer	We went to the downtown SF location. The resta...	We went to the downtown SF location.	went downtown location	downtown location
4	4	21st Amendment Brewery & Restaurant	Good food & beer	We went to the downtown SF location. The resta...	The restaurant was really clean and servers we...	restaurant clean servers nice	restaurant servers

	text	topic_words	aspect_corex
403	We set out to visit this bar for midweek lunch...	[visit, bar, midweek, lunchtime, beer, eat, ho...	order
404	However, the buses were not very frequent, whi...	[frequent, possibly, area, destination]	place_location
405	We were warmly greeted and had no problem in g...	[warmly, greet, table, pleasantly, airy, room]	experience
406	The service was very good;	[service]	service
407	the helpful waitress spoke knowledgeably about...	[helpful, waitress, menu, choice]	menu_offering
408	The American IPA, 5-South, was delicious and s...	[south, delicious, blonde, ale]	bar_beverage
409	The food was fairly typical US pub style, but ...	[food, fairly, pub, style, freshly, cook, orde...	food
410	We had no regrets about choosing the 21st Amen...	[]	miscellaneous
411	We would certainly visit again, but if you're ...	[visit, sound, drinker, atmosphere]	ambience

	review_title	text	topic_words	aspect_corex
191	Nice take on the Brew Pub experience	There is a loft area for private parties.	[loft, area]	place_location
192	Nice take on the Brew Pub experience	I had the El Cubano which I would recommend.	[recommend]	recommendation
193	21st Amendment - Always a Good Time	I love San Francisco and make it a point to vi...	[visit]	experience
194	21st Amendment - Always a Good Time	I love their Watermelon beer (Hell or High Wat...	[beer, serve, slice, fresh, watermelon]	bar_beverage
195	21st Amendment - Always a Good Time	They have a great selection of craft beers and...	[selection, craft_beer, restaurant, menu, comp...	bar_beverage
77	Solid experience	We were able to get a table.	[table]	experience
78	Solid experience	They have several big screen tv's as well.	[screen]	facility
79	Solid experience	The atmosphere is lively.	[atmosphere, lively]	ambience
80	Solid experience	We did a tasting flight of six for 12 bucks. ....	[taste, buck]	bar_beverage
81	Solid experience	The beer was good.	[beer]	bar_beverage
82	Solid experience	The food was pretty good as well.	[food]	order
83	Solid experience	The Reuben sandwich stood out the most.	[sandwich]	food
84	Solid experience	This place seems to be kid friendly as well.	[place, friendly]	place_location
85	Perfect for a pre-match beer	If you're on your way to a game at AT&T Park t...	[place, beer]	bar_beverage
86	Perfect for a pre-match beer	There's a terrific atmosphere at this bar and ...	[atmosphere, bar, beer, offer]	bar_beverage
133	Apps and beers before steakhouse	The jerk chicken wings were awesome, the BBQ w...	[jerk, chicken_wing, bbq, adequate]	food
134	Apps and beers before steakhouse	Service was excellent also.	[service_excellent]	service
135	Apps and beers before steakhouse	Great place!	[place]	place_location
136	Good beer, great food	Stopped by in search of Hop Crisis, which they...	[stop, serve]	place_location
137	Good beer, great food	The food, however, was the real star - thoroug...	[food, star, cup, chili, spinach, arugula, sal...	food
138	Good beer, great food	Very, very good - the brussels sprouts with th...	[sprout, chicken, golf, ball]	food