__author__ = "Amoli Rajgor"
__email__ = "amoli.rajgor@gmail.com"
__website__ = "amolir.github.io"
ℹ️ Dependencies
➤ numpy ≥ 1.22.3
➤ pandas ≥ 1.4.1
➤ scikit-learn ≥ 1.0.2
➤ keybert ≥ 0.5.1
➤ nltk ≥ 3.5
➤ matplotlib ≥ 3.5.1
➤ altair ≥ 4.2.0
➤ dask ≥ 2022.4.1
pip install requirements.txt
from the terminal to install all the dependencies before running the notebook. data
and place the downloaded .csv_ file inside it.preprocessed.csv
and keywords.csv
) will be stored in the data
folder itself.eda.ipybn
-> feature_engineering.ipynb
-> model.ipynb
to generate results.# Data Manipulation
import pandas as pd
import numpy as np
# RegEx and String Manipulation
import re
import string
# Language Detection
from nltk.classify.textcat import TextCat
# Multiprocessing
import dask.dataframe as dd
import multiprocessing
# BERT-Embeddings
from keybert import KeyBERT
# TF-IDF Vectorization
from sklearn.feature_extraction.text import TfidfVectorizer
# Plotting Heatmap of TF-IDF vectors
import matplotlib.pyplot as plt
import altair as alt
# Cosine Similarity
from sklearn.metrics.pairwise import cosine_similarity
# Minimum number of words that should be present in a description (value starting from 1)
min_description_word_count = 3
books_data = pd.read_csv("data/goodreads_book.csv", usecols=['Id', 'Name', 'Authors', 'ISBN', 'PublishYear', 'Publisher', 'Language', 'Description'])
display(books_data.shape)
books_data.head(5)
Notebook: eda.ipynb
Text cleaning in NLP is the process of transforming the textual data into a format that is machine readable. Cleaning of the data is required to reduce the complexity of the model and increase its accuracy. We want to avoid processing irrelevant words and want the model to give equal weightage to the same words in spite of punctuations, letter case etc. Let's apply following steps to clean various features before performing keyword extraction onto it. Data is already processed (Check eda.ipynb) using these steps and results are store in data/preprocessed.csv
.
Description of the book becomes content for the recommendation engine. We want to extract keywords from the description of the book in such a case books with no description won’t add value to the model. However to avoid information loss in real scenario, missing descriptions can be filled with empty strings.
books_data.dropna(subset=["Description"], inplace=True)
Description feature contains URLs, HTML tags and punctuations (example below). Let’s remove all this irrelevant textual information to refine it.
list(books_data.Description[books_data.Id == 1099555]) #Description with url and html tag
url_pattern = re.compile(r'https?://\S+|www\.\S+')
def remove_url(text):
return re.sub(url_pattern, r'', text)
html_pattern = re.compile('<[^>]*>')
def clean_html_tags(text):
return re.sub(html_pattern, r'', text)
punctuations = string.punctuation
def remove_punctuations(text):
return text.translate(str.maketrans('', '', punctuations))
books_data.Description = books_data.Description.apply(remove_url)
books_data.Description = books_data.Description.apply(clean_html_tags)
books_data.Description = books_data.Description.apply(remove_punctuations)
# Result
list(books_data.Description[books_data.Id == 1099555])
unknown
to retain these missing values during string transformation. books_data[["Publisher"]] = books_data[["Publisher"]].fillna("unknown")
books_data[["Name", "Authors", "Publisher", "Description"]] = pd.concat([books_data[col].astype(str).str.lower().str.strip()
for col in ["Name", "Authors", "Publisher", "Description"]],
axis=1)
After removing extra spaces it is found that some book descriptions only had blank spaces and really short descriptions containing one or two words. Such words do not retain the semantic meaning of the description and l will remove books with such shorter descriptions (word count less than four). We will first remove empty descriptions to NaN
and then remove them.
# Find description word count
books_data["length"] = [len(d.split()) for d in books_data['Description'].tolist()]
print(set(books_data.Description[books_data.length.isin(range(0,4))]))
# Replace empty strings of description with NaN
books_data.Description = books_data.Description.replace(r'^\s*$', np.nan, regex=True)
books_data[books_data.length.isin(range(1,min_description_word_count+1))][["Id", "Name", "Description", "length"]]\
.sort_values(by=["length"], ascending=True).head(5)
books_data.dropna(subset=["Description"], inplace=True)
# Drop records with very short description
books_data.drop(books_data.index[books_data.length.isin(range(0,min_description_word_count+1))], inplace = True)
del books_data["length"]
# Convert unknown to NaN
books_data["Publisher"] = books_data.Publisher.replace('unknown',np.nan)
books_data = books_data.sort_values(by="Publisher", na_position='last')\
.drop_duplicates(subset=["Name", "Authors", "Description"], keep='first')
Though we have deleted rows with the same Name, Authors and Description, we still find books having duplicated Description. This happens due to minor textual changes in the Name of the book and also because certain Descriptions are repetitive for different books.
series_pattern = "(?:[;]\s*|\(\s*)([^\(;]*\s*#\s*\d+(?:\.?\d+|\\&\d+|-?\d*))"
def get_book_series_info(text):
series_info = re.findall(series_pattern, text)
if series_info:
series_info = " ".join([i.replace(" ", "_") for i in series_info])
return series_info
else:
return np.nan
books_data['BookSeriesInfo'] = books_data.Name.apply(get_book_series_info)
Ranma ½ (US 2nd), #28]
, instead, it extracts [US 2nd), #28]
series_remove_pattern = re.compile("(?:[\(]\s*[^\(;]*\s*#\s*\d+(?:\.?\d+|\\&\d+|-?\d*)(?:;|\))|\s*[^\(;]*\s*#\s*\d+(?:\.?\d+|\\&\d+|-?\d*)\))")
def remove_series_info(text):
return re.sub(series_remove_pattern, r'', text)
books_data["Title"]= books_data["Name"].str.replace(series_remove_pattern, r'').str.strip()
Language feature has missing values. I will impute missing Language information using the language of the book Name. Language detection for thousands of records takes considerable time. I have already saved the results into a CSV preprocessed.csv
. I will directly use that for further processing.
tc = TextCat()
def detect_language(text):
text = " ".join(text.split()[:5])
if text.isnumeric():
return 'eng'
else:
return tc.guess_language(text).strip()
"""
Takes longer time to process thousands records hence results are presaved in preprocessed.csv
"""
# ddf = dd.from_pandas(books_data, npartitions=4*multiprocessing.cpu_count())
# books_data["Language"] = ddf.map_partitions(lambda df: df.apply(lambda x: detect_language(x['Name']) if pd.isna(x['Language']) else x['Language'], axis=1)).compute()
# books_data.isna().sum()
temp_preview = books_data.head(5).copy()
ddf = dd.from_pandas(temp_preview, npartitions=4*multiprocessing.cpu_count())
temp_preview["Language"] = ddf.map_partitions(lambda df: df.apply(lambda x: detect_language(x['Name']) if pd.isna(x['Language']) else x['Language'], axis=1)).compute()
temp_preview
books_data["Publisher"] = books_data["Publisher"].str.replace('"','')
_
) so that two authors with same first or last name are not considered same when the tokenization happens.books_data["Authors"] = books_data["Authors"].str.strip().str.replace(' ','_')
books_data["Publisher"] = books_data["Publisher"].str.strip().str.replace(' ','_')
books_data.head(5)
Combine all the book information related tokens such as book series information, authors, publisher, language, publish year into a single summary column.
books_data["bow"] = eda_data[["BookSeriesInfo", 'Authors', 'Publisher', 'Language']].fillna('').agg(' '.join, axis=1)
books_data.bow.iloc[8375]
# Save cleaned data
# books_data.to_csv("data/preprocessed.csv", sep=",", index=False)
Notebook: feature_engineering.ipynb
Extracted keywords are stored are stored in data/keywords.csv
.
# Fetch preprocessed cleaned data
fe_data = pd.read_csv("data/preprocessed.csv", usecols=["Id", "Name", "Language", "Description", "bow"])
fe_data.head()
all-MiniLM-L6-v2
sentence-transformer model from HuggingFace🤗 transformer. But depending on the need, different pretrained models can be selected. For keyword extraction it uses Bag-Of-Words techniques.kw_model = KeyBERT()
def get_keywords(text):
keywords = kw_model.extract_keywords(text, keyphrase_ngram_range=(1, 1), stop_words="english")
keywords = " ".join([k[0] for k in keywords])
return keywords
fe_data["keywords"] = fe_data.Description.apply(get_keywords)
fe_data.keywords.head()
fe_data["keywords"] = fe_data[['bow', 'keywords']].fillna('').agg(' '.join, axis=1)
fe_data.drop(['bow', 'Description'], axis = 1, inplace=True)
fe_data = fe_data.drop_duplicates(subset=["Name"], keep='first')
# Store results
# fe_data.to_csv("data/keywords.csv", sep=",", index=False)
# Fetch keywords data
model_data = pd.read_csv("data/keywords.csv")
model_data.head()
from sklearn.feature_extraction.text import TfidfVectorizer
tfidf = TfidfVectorizer(analyzer = 'word',
min_df=3,
max_df = 0.6,
stop_words="english",
encoding = 'utf-8',
token_pattern=r"(?u)\S\S+")
tfidf_encoding = tfidf.fit_transform(model_data["keywords"])
# Preview first 100 words in the vocabulary
print(tfidf.get_feature_names_out()[1:100])
tfidf_df = pd.DataFrame(tfidf_encoding.toarray(), index=model_data["Name"], columns=tfidf.get_feature_names_out())
# Find top 50 books with maximum tf-idf total score
tfidf_df["total"]= tfidf_df.sum(axis=1)
tfidf_df = tfidf_df.sort_values("total", ascending=False)
del tfidf_df["total"]
# Leave first few words containing years and select top 50 books
tfidf_df_preview = tfidf_df.iloc[100:150,25:].copy()
tfidf_df_preview = tfidf_df_preview.stack().reset_index()
tfidf_df_preview = tfidf_df_preview.rename(columns={0:'tfidf', 'Name': 'book','level_1': 'term'})
tfidf_df_preview = tfidf_df_preview.sort_values(by=['book','tfidf'], ascending=[True,False]).groupby(['book']).head(10)
display(tfidf_df_preview)
def process_word_matrix(word_vec):
# Remove underscores in terms
word_vec.term = word_vec.term.str.replace('_',' ')
# Remove terms with zero tfidf score
word_vec = word_vec[word_vec.tfidf > 0]
return word_vec
tfidf_vec = process_word_matrix(tfidf_df_preview.copy())
tfidf_vec.iloc[0:5]
import altair as alt
grid = alt.Chart(tfidf_vec).encode(
x = 'rank:O',
y = 'book:N'
).transform_window(
rank = "dense_rank()",
sort = [alt.SortField("tfidf", order="descending")],
groupby = ["book"],
)
heatmap = grid.mark_rect(size=5).encode(
alt.Color('tfidf:Q', scale=alt.Scale(scheme='redpurple'))
)
text = grid.mark_text(align='center', baseline='middle', lineBreak='').encode(
text = 'term:N',
color = alt.condition(alt.datum.tfidf >= 0.23, alt.value('white'), alt.value('black'))
)
(heatmap+text).properties(width = 800)
Once the numerical vector representation of the data is created for each book, it becomes possible to apply all the techniques applicable in a geometric space. It is possible to find similarity between two vectors (and thereby between books represented as vectors). Cosine similarity is a measure that can find if two vectors in a multidimensional space are pointing in the same direction. It finds the cosine of the angle between the two vectors; smaller the angle higher is the cosine value. If the vectors are perpendicular the cosine similarity becomes zero signifying the dissimilarity between the vectors.
book_cosine_sim = cosine_similarity(tfidf_encoding, tfidf_encoding)
Each row and column in the similarity matrix represents a book and contains cosine similarity values between them. Diagonal values would be 1 ($\cos(0)$) because a book will have absolute similarity with itself. The similarity matrix generated is sparse so we will use spy to visualise the non-zero elements. All the blue markers in the image below are non-zero values. Books in the indices ranging in 1800-2000 have similarity with very few books.
# Preview Similarity Matrix
book_cosine_sim
# Vidualize similarity between books
plt.figure(figsize=(6, 6), dpi=80)
plt.spy(book_cosine_sim, precision = 0.1, markersize = 0.04)
plt.tight_layout()
plt.show()
Purpose is: given a book name, find top n
similar books based on cosine similarity score. In real use cases, the input book could be the book a user has read, has rated highly or have added to the read later list.
Books are recommended utilising the following information through keywords:
books = pd.Series(model_data['Name'])
def recommend_books_similar_to(book_name, n=5, cosine_sim_mat=book_cosine_sim):
# get index of the imput book
input_idx = books[books == book_name].index[0]
# Find top n similar books with decreasing order of similarity score
top_n_books_idx = list(pd.Series(cosine_sim_mat[input_idx]).sort_values(ascending = False).iloc[1:n+1].index)
# [1:6] to exclude 0 (index 0 is the input movie itself)
books_list = list(books)
recommended_books = [books[i] for i in top_n_books_idx]
return recommended_books
# Recommendations with series information
print("\033[1m{}\033[0m".format("Recommendation (Series Information) based on the read: The Eastland Disaster (Images of America: Illinois)"))
display(recommend_books_similar_to("the eastland disaster (images of america: illinois)", 5))
# Recommendations with series information numbered
print("\n\033[1m{}\033[0m".format("Recommendation (Numbered Series) based on the read: The Majolica Murders (Antique Lover, #5)"))
display(recommend_books_similar_to("the majolica murders (antique lover, #5)", 5))
print("\n\033[1m{}\033[0m".format("Recommendation (Theme: Programming) based on the read: The Practice of Programming (Addison-Wesley Professional Computing Series)"))
display(recommend_books_similar_to('the practice of programming (addison-wesley professional computing series)', 5))
print("\n\033[1m{}\033[0m".format("Recommendation (Author: Dean Koontz) based on the read: Cold Fire"))
display(recommend_books_similar_to("cold fire",5))