Documentation for the Code¶

HelperUtils.py¶

extract_book_name_from_root_url(root_url)
check_if_file_exists_otherwise_handle(file_path)
data_for_book_exists_current_date(book_data_folder)

HelperUtils.check_if_file_exists_otherwise_handle(file_path)¶

This function checks if the path file exists in the path If it doesnt then returns false and if it exists then deletes it and returns true

Parameters:	file_path (str) – path of the file to check
Returns:	bool flag indicating whether file exists or not

HelperUtils.data_for_book_exists_current_date(book_data_folder)¶

Checks whether a folder exists and if its creation is the current date or not

Parameters:	book_data_folder (str) – folder name/partial path
Returns:	bool flag indicating whether folder exists or not

FilePicking.py¶

save_obj(obj, name, directory )
load_obj(name, directory )
load_latest_obj(name, directory)

FileUtil.FilePicking.load_latest_obj(name, directory)¶

Loads the latest .pkl file from a book directory

Parameters:	name (str) – name with which to save the .pkl file directory (str) – name of the directory where the .pkl file is saved

FileUtil.FilePicking.load_obj(name, directory)¶

Loads the .pkl file named name from directory

Parameters:	name (str) – name with which to load the .pkl file directory (str) – name of the directory where the .pkl file is saved

FileUtil.FilePicking.save_obj(obj, name, directory, json_save_needed)¶

Functions checks if the individual book directory exists as of current date If it does’nt exist then it creates it otherwise it deletes the older directory and recreates it Pickles and saves the obj with the name name in dir directory Also saves the data as json in the same folder based on json_save_needed

Parameters:	obj (python object) – this is being pickled name (str) – name with which to save the .pkl file directory (str) – name of the directory where the .pkl file is saved json_save_needed (bool) – indicates whether data needs to be saved as json or not

GenreScraper.py¶

Functions:

_create_main_parser(genre)
_build_book_details_map((sci_fi_book_details, book_index, book_details))
_retrieve_book_name(book_block, class_name)
_retrieve_book_URL_and_image_URL(book_block)
_retrieve_author_name(book_block, class_name)
_retrieve_number_of_times_shelved(book_block)
_retrieve_rating_published_details(book_block)
retriveSciFiBookList(genre)

GenreScraper._create_main_parser(genre)¶

Creates the bs4 parser from the goodreads URL. This is to scrape details of most popular books in a genre

Parameters:	genre (str) – Genre to scrape book details
Returns:	parser
Return type:	bs4 object

GenreScraper._build_book_details_map(sci_fi_book_details, book_index, book_details)¶

Build the dictionary containing the details of the book. This is then further pickled for persistent storage

Parameters:	sci_fi_book_details (dict) – dictionary to store book details book_index (int) – counter variable for the dictionary sci_fi_book_details book_details (list) – List containing the different details about the book needs to be stored in the dict sci_fi_book_details (that) –
Returns:	book details
Return type:	dict

GenreScraper._retrieve_book_name(book_block, class_name)¶

Finds the book name which is in a link <a> tag

Parameters:	book_block (bs4) – represents the bs4 for an individual book on the webpage class_name (str) – the CSS class name needed to extract the book name
Returns:	name of the book
Return type:	str

GenreScraper._retrieve_book_URL_and_image_URL(book_block)¶

Finds the book URL (which we use in :mod: web_scraper_goodreads_root.BookReviews) and book image URL

Parameters:	book_block (bs4) – represents the bs4 for an individual book on the webpage
Returns:	book url and book image url
Return type:	list

GenreScraper._retrieve_author_name(book_block, class_name)¶

Finds the author details of the book Author details are: -author name -Goodreads author URL -is author on goodreads or not

Parameters:	book_block (bs4) – represents the bs4 for an individual book on the webpage class_name (str) – the CSS class name needed to extract the book name
Returns:	author details
Return type:	list

GenreScraper._retrieve_number_of_times_shelved(book_block)¶

Finds the number of times a book has been shlved by people Shelving is just a way for users of goodreads to categorize a book

Parameters:	book_block (bs4) – represents the bs4 for an individual book on the webpage
Returns:	number of times shelved
Return type:	int

GenreScraper._retrieve_rating_published_details(book_block)¶

Finds rating details of the book Rating details include: -average rating -number of ratings -year the book was published

Parameters:	book_block (bs4) – represents the bs4 for an individual book on the webpage
Returns:	rating details
Return type:	list

GenreScraper.retriveSciFiBookList(genre)¶

Main entry into GenreScraper This function does the following:

Creates the bs4 parser - _create_main_parser(genre)

Gets a list of bs4 for each of the books

Loops through each book, retrives the book details and stores in dict sci_fi_book_details

Parameters:	genre (str) – genre to scrape details about
Returns:	book details from a particular genre
Return type:	dict

MainBookScraper.py¶

Functions:

generate_book_review_images() : Scrapes goodreads.com for reviews and visualizes data

MainBookScraper.generate_book_review_images(genre)¶

Does the following:

Scrapes goodreads.com to get list of most popular book & details for input genre

Saves details to pickle file

Loads the latest pickle file data

Loops through the book list to:

extract book name from book URL

scrape book review details

save details to pickle file

load latest pickle file data

visualize review likes data

Args:

genre (str): book genre to process

Note

When there is a timeout exception during scraping, generate_book_review_images function will retry upto FAILURE_THRESHOLD from web_scraper_goodreads_root.CommonConstants.Constants times before skipping the book

SiteNavigator.py¶

Functions:

_init(root_url)
get_html_code_for_first_page(root_url, new_book)
get_html_code_for_other_pages(root_url)

SiteNavigator._init(root_url)¶

Creates and initializes selenium object for traversal. Running via chromedriver.exe. For this code to work ensure chromedriver.exe is on the path. Download from here Making this a headless selenium object via chrome_options

Parameters:	root_url (str) – used to initialize selenium object
Returns:	selenium instance

SiteNavigator.get_html_code_for_first_page(root_url, new_book)¶

Visits the first page in a book and returns the HTML code for the book review part The new_book indicator helps to ensure we recreate the selenium driver instance for every book

Parameters:	root_url (str) – used to initialize selenium object new_book (bool) – indicates whether its a new book or not
Returns:	html code of the book reviews

SiteNavigator.get_html_code_for_other_pages(root_url)¶

Visits the other review pages and returns the html code Cliking on the a review page at the bottom performs an Ajax call which returns a Element.update() which in turn updates the “reviews” id with HTML code

The way we check whether the review data has loaded is by adding a dummy id in the HTML and waiting till its not present anymore after the Ajax call

Parameters:	root_url (str) – used to initialize selenium object
Returns:	html code of the book reviews

BookReviews.py¶

_create_book_review_scraper_from_source(html_source)
_retrieve_review_rating(book_review_tag)
_retrieve_review_likes(first_page_book_review_tag)
_retrieve_review_date(first_page_book_review_tag)
_build_review_rating_map(book_review_details, book_review_index, key, value)
_retrieve_book_review_details_per_page(book_review_details, root_book_review_tags, book_review_index)
retrieve_book_review_details(book_url, new_book)

BookReviews._create_book_review_scraper_from_source(html_source)¶

Creates the bs4 parser from the HTML source

Parameters:	html_source (str) – html source of the book
Returns:	bs4 parser

BookReviews._retrieve_review_rating(book_review_tag)¶

Retrieves the rating given by the review Maps to in integer value from web_scraper_goodreads_root.CommonConstants.Constants

Parameters:	book_review_tag (bs4) – represents the bs4 instance for a particulat review
Returns:	review rating

BookReviews._retrieve_review_likes(first_page_book_review_tag)¶

Retrieves the number of likes on the review

Parameters:	first_page_book_review_tag (bs4) – represents the bs4 instance for a particulat review
Returns:	review likes

BookReviews._retrieve_review_date(first_page_book_review_tag)¶

Retrieves the review date

Parameters:	first_page_book_review_tag (bs4) – represents the bs4 instance for a particulat review
Returns:	date when the review was posted

BookReviews._build_review_rating_map(book_review_details, book_review_index, key, value)¶

Build the dict book_review_details

Parameters:	book_review_details (dict) – the book review details book_review_index (int) – counter variable for the books key (str) – key used in dict book_review_details value (str) – value used in dict book_review_details
Returns:	details of the reviews for a book

BookReviews._retrieve_book_review_details_per_page(book_review_details, root_book_review_tags, book_review_index)¶

This functions is the main driver function for all other functions in this file Retrieves details of the book which include: - rating of the book by the review - likes on the review - date of the review

Parameters:	book_review_details (dict) – the book review details root_book_review_tags (bs4) – bs4 instance for a particular review book_review_index (int) – counter variable for the books

Returns: - details of the reviews - which book number was processed

BookReviews.retrieve_book_review_details(book_url, new_book)¶

Main entry function into this file’s code Also handles the progress bar Basically this function scrapes review data from the first page and then visits each of the review pages and scrapes review data from them

Parameters:	book_url (str) – URL of the book new_book (bool) – indicates whether its a new book or not
Returns:	review details of the book

book_review_visualization.py¶

_build_color_scatter_plot_array(book_review)
visualize_and_save_review_information(book_review, book_name)

book_review_visualization._build_color_scatter_plot_array(book_review)¶

Creates the review color list based on the review rating

Parameters:	book_review (dict) – details of the book review
Returns:	list of colors based on the review

book_review_visualization.visualize_and_save_review_information(book_review, book_name)¶

Main function which visualizes the review likes. Each review is represented as a circle, the more likes a review has the larger the circle is The color of the circle is based on how positive the review is Also saves a high res PNG image of the review visualization Normalizes the review likes via min max normalization

Parameters:	book_review (dict) – details of the book review book_name (str) – the name of the book whose review details are to be visualized

review_rating_calculation.py¶

Note

Reference - https://www.analyticsvidhya.com/blog/2019/07/introduction-online-rating-systems-bayesian-adjusted-rating/

_extract_review_likes_ratings(book_review)
_calculate_simple_avg_review_rating(review_ratings)
_convert_to_bayesian_adj_rating(review_likes, review_ratings)
_calculate_bayesian_adj_rating(bayesian_adj_ratings)
_build_ratings_list(processed_book_review_info)
quicksort(arr_to_be_sorted, start, end)
_process_reviews()

review_rating_calculation._extract_review_likes_ratings(book_review)¶

Extracts the likes and ratings from the dict (pkl file)

Parameters:	book_review (dict) – details of a book
Returns:	likes and ratings
Return type:	2 lists

review_rating_calculation._calculate_simple_avg_review_rating(review_ratings)¶

Calculates the average rating

Parameters:	review_ratings (list) – ratings from the reviews
Returns:	average rating
Return type:	float

review_rating_calculation._convert_to_bayesian_adj_rating(review_likes, review_ratings)¶

Calculates the Bayesian Adjusted ratings from the goodreads ratings and likes on the ratings

Parameters:	review_likes (list) – likes from the reviews review_ratings (list) – ratings from the reviews
Returns:	bayesian adjusted ratings
Return type:	list

review_rating_calculation._calculate_bayesian_adj_rating(bayesian_adj_ratings)¶

Calculates the average bayesian adjusted rating

Parameters:	bayesian_adj_ratings (list) – bayesian adjusted ratings from the reviews
Returns:	average bayesian adjusted rating
Return type:	float

review_rating_calculation._build_ratings_list(processed_book_review_info)¶

Build a list rating details from a python dict

Parameters:	processed_book_review_info (dict) – processed review details
Returns:	processed review details but as a list
Return type:	list

review_rating_calculation.quicksort(arr_to_be_sorted, start, end)¶

Quicksort implementation to sort a list in ascending order Complexity - n * logn

Parameters:	arr_to_be_sorted (list) – the input list to be sorted start (int) – the start index end (int) – the end index

review_rating_calculation._process_reviews()¶: Main code to start processing the review details Ensure sci-fi-books-list_YYYY-MM-DD.pkl file is present in current date otherwise run MainBookScraper to get it We are making sure to add 1 to review likes which are 0 so as to not ignore those reviews completely

Documentation for the Code¶

HelperUtils.py¶

FilePicking.py¶

GenreScraper.py¶

MainBookScraper.py¶

SiteNavigator.py¶

BookReviews.py¶

book_review_visualization.py¶

review_rating_calculation.py¶

web scraper - goodreads

Navigation

Related Topics