Introduction
In today’s challenging job market, individuals must gather reliable information to make informed career decisions. Glassdoor is a popular platform where employees anonymously share their experiences. However, the abundance of reviews can overwhelm job seekers. We will attempt to build an NLP-driven system that automatically condenses Glassdoor reviews into insightful summaries to address this. Our project explores the step-by-step process, from using Selenium for review collection to leveraging NLTK for summarization. These concise summaries provide valuable insights into company culture and growth opportunities, aiding individuals in aligning their career aspirations with suitable organizations. We also discuss limitations, such as interpretation differences and data collection errors, to ensure a comprehensive understanding of the summarization process.
Learning Objectives
The learning objectives of this project encompass developing a robust text summarization system that effectively condenses voluminous Glassdoor reviews into concise and informative summaries. By undertaking this project, you will:
- Understand how to summarize reviews from public platforms, in this case, Glassdoor, and how it can immensely benefit individuals seeking to evaluate an organization before accepting a job offer. Recognize the challenges posed by the vast amount of textual data available and the need for automated summarization techniques.
- Learn the fundamentals of web scraping and utilize the Selenium library in Python to extract Glassdoor reviews. Explore navigating web pages, interacting with elements, and retrieving textual data for further analysis.
- Develop skills in cleaning and preparing textual data extracted from Glassdoor reviews. Implement methods to handle noise, remove irrelevant information, and ensure the quality of the input data for effective summarization.
- Utilize the NLTK (Natural Language Toolkit) library in Python to leverage a wide range of NLP functionalities for text processing, tokenization, sentence segmentation, and more. Gain hands-on experience in using these tools to facilitate the text summarization process.
This article was published as a part of the Data Science Blogathon.
Project Description
Minimize reviewing a considerable volume of Glassdoor reviews feedback by developing an automated text summarization system. By harnessing natural language processing (NLP) techniques and machine learning algorithms, this system extracts the most pertinent information from the reviews and generates compact and informative summaries. The project will entail data collection from Glassdoor utilizing Selenium, data preprocessing, and cutting-edge text summarization techniques to empower individuals to expeditiously grasp salient insights about an organization’s culture and work environment.
Problem Statement
This project aims to assist people in interpreting an organization’s culture and work environment based on numerous Glassdoor reviews. Glassdoor, a highly used platform, has become a primary resource for individuals to gather insights about potential employers. However, the vast number of reviews on Glassdoor can be daunting, posing difficulties for individuals to distill useful insights effectively.
Understanding an organization’s culture, leadership style, work-life harmony, advancement prospects, and overall employee happiness are key considerations that can significantly sway a person’s career decisions. But, the task of navigating through numerous reviews, each differing in length, style, and focus areas, is indeed challenging. Furthermore, the lack of a concise, easy-to-understand summary only exacerbates the issue.
The task at hand, therefore, is to devise a system for summarizing text that can efficiently process the myriad of Glassdoor reviews and deliver succinct yet informative summaries. By automating this process, we aim to provide individuals with an exhaustive overview of a company’s characteristics in a user-friendly manner. The system will enable job hunters to quickly grasp key themes and sentiments from the reviews, facilitating a smoother decision-making process regarding job opportunities.
In resolving this problem, we aim to alleviate the information saturation faced by job seekers and empower them to make informed decisions that align with their career goals. The text summarization system developed through this project will be an invaluable resource for individuals seeking to understand an organization’s work climate and culture, providing them the confidence to navigate the employment landscape.
Approach
We aim to streamline the understanding of a company’s work culture and environment through Glassdoor reviews. Our strategy involves a systematic process encompassing data collection, preparation, and text summarization.
- Data Collection: We will utilize the Selenium library for scraping Glassdoor reviews. This will enable us to accumulate many reviews for the targeted company. Automating this process ensures the collection of a diverse set of reviews, offering a comprehensive range of experiences and viewpoints.
- Data Preparation: Once the reviews are collected, we will undertake data preprocessing to ensure the extracted text’s quality and relevance. This includes removing irrelevant data, addressing unusual characters or formatting inconsistencies, and segmenting the text into smaller components like sentences or words.
- Text Summarization: In the text summarization phase, we will employ natural language processing (NLP) techniques and machine learning algorithms to generate brief summaries from the preprocessed review data.
Scenario
Imagine the case of Alex, a proficient software engineer who has been offered a position at Salesforce, a renowned tech firm. Alex wants to delve deeper into Salesforce’s work culture, environment, and employee satisfaction as part of their decision-making process.
With our method of condensing Glassdoor reviews, Alex can swiftly access the main points from many Salesforce-specific employee reviews. By leveraging the automated text summarization system we’ve created, Alex can obtain concise summaries that highlight key elements such as the firm’s team-oriented work culture, advancement opportunities, and overall employee contentment.
By reviewing these summaries, Alex can thoroughly understand Salesforce’s corporate characteristics without spending too much time reading the reviews. These summaries provide a compact yet insightful perspective, enabling Alex to make a decision that aligns with their career goals.
Data Collection & Preparation
We will employ the Selenium library in Python to procure reviews from Glassdoor. The provided code snippet meticulously elucidates the process. Below, we outline the steps involved in maintaining transparency and compliance with ethical standards:
Importing Libraries
We begin by importing the necessary libraries, including Selenium, Pandas, and other essential modules, ensuring a comprehensive environment for data collection.
# Importing the necessary libraries
import selenium
from selenium import webdriver as wb
import pandas as pd
import time
from time import sleep
from selenium.webdriver.support.ui
import WebDriverWait
from selenium.webdriver.common.by
import By
from selenium.webdriver.support
import expected_conditions as EC
from selenium.webdriver.common.keys
import Keys
import itertools
Setting Up Chrome Driver
We establish the setup for the ChromeDriver by specifying the appropriate path where it is stored, thus allowing seamless integration with the Selenium framework.
# Chaning the working directory to the path
# where the chromedriver is saved & setting
# up the chrome driver
%cd "PATH WHERE CHROMEDRIVER IS SAVED"
driver = wb.Chrome(r"YOUR PATHchromedriver.exe")
driver.get('https://www.glassdoor.co.in
/Reviews/Salesforce-Reviews-E11159.
htm?sort.sortType=RD&sort.ascending=false&filter.
iso3Language=eng&filter.
employmentStatus=PART_TIME&filter.employmentStatus=REGULAR')
Accessing the Glassdoor Page
We employ the driver.get() function to access the Glassdoor page housing the desired reviews. For this example, we specifically target the Salesforce reviews page.
Iterating through Reviews
Within a well-structured loop, we iterate through a predetermined number of pages, enabling systematic and extensive review extraction. This count can be adjusted based on individual requirements.
Expanding Review Details
We proactively expand the review details during each iteration by interacting with the “Continue Reading” elements, facilitating a comprehensive collection of pertinent information.
We systematically locate and extract many review details, including review headings, job particulars (date, role, location), ratings, employee tenure, pros, and cons. These details are segregated and stored in separate lists, ensuring accurate representation.
Creating a DataFrame
By leveraging the capabilities of Pandas, we establish a temporary DataFrame (df_temp) to house the extracted information from each iteration. This iterative DataFrame is then appended to the primary DataFrame (df), allowing consolidation of the review data.
To manage the pagination process, we efficiently locate the “Next” button and initiate a click event, subsequently navigating to the next page of reviews. This systematic progression continues until all available reviews have been successfully acquired.
Data Cleaning and Sorting
Finally, we proceed with essential data-cleaning operations, such as converting the “Date” column to a datetime format, resetting the index for improved organization, and sorting the DataFrame in descending order based on the review dates.
This meticulous approach ensures the comprehensive and ethical collection of many Glassdoor reviews, enabling further analysis and subsequent text summarization tasks.
# Importing the necessary libraries
import selenium
from selenium import webdriver as wb
import pandas as pd
import time
from time import sleep
from selenium.webdriver.support.ui
import WebDriverWait
from selenium.webdriver.common.by
import By
from selenium.webdriver.support
import expected_conditions as EC
from selenium.webdriver.common.keys
import Keys
import itertools
# Changing the working directory to the path
# where the chromedriver is saved
# Setting up the chrome driver
%cd "C:UsersakshiOneDriveDesktop"
driver = wb.Chrome(r"C:UsersakshiOneDriveDesktopchromedriver.exe")
# Accessing the Glassdoor page with specific filters
driver.get('https://www.glassdoor.co.in/Reviews/
Salesforce-Reviews-E11159.htm?sort.sortType=RD&sort.
ascending=false&filter.iso3Language=eng&filter.
employmentStatus=PART_TIME&filter.employmentStatus=REGULAR')
df = pd.DataFrame()
num = 20
for _ in itertools.repeat(None, num):
continue_reading = driver.find_elements_by_xpath(
"//div[contains(@class,'v2__EIReviewDetailsV2__
continueReading v2__EIReviewDetailsV2__clickable v2__
EIReviewDetailsV2__newUiCta mb')]"
)
time.sleep(5)
review_heading = driver.find_elements_by_xpath("//a[contains
(@class,'reviewLink')]")
review_heading = pd.Series([i.text for i in review_heading])
dets = driver.find_elements_by_xpath("//span[contains(@class,
'common__EiReviewDetailsStyle__newUiJobLine')]")
dets = [i.text for i in dets]
dates = [i.split(' - ')[0] for i in dets]
role = [i.split(' - ')[1].split(' in ')[0] for i in dets]
try:
loc = [i.split(' - ')[1].split(' in ')[1] if
i.find(' in ')!=-1 else '-' for i in dets]
except:
loc = [i.split(' - ')[2].split(' in ')[1] if
i.find(' in ')!=-1 else '-' for i in dets]
rating = driver.find_elements_by_xpath("//span[contains
(@class,'ratingNumber mr-xsm')]")
rating = [i.text for i in rating]
emp = driver.find_elements_by_xpath("//span[contains
(@class,'pt-xsm pt-md-0 css-1qxtz39 eg4psks0')]")
emp = [i.text for i in emp]
pros = driver.find_elements_by_xpath("//span[contains
(@data-test,'pros')]")
pros = [i.text for i in pros]
cons = driver.find_elements_by_xpath("//span[contains
(@data-test,'cons')]")
cons = [i.text for i in cons]
df_temp = pd.DataFrame(
{
'Date': pd.Series(dates),
'Role': pd.Series(role),
'Tenure': pd.Series(emp),
'Location': pd.Series(loc),
'Rating': pd.Series(rating),
'Pros': pd.Series(pros),
'Cons': pd.Series(cons)
}
)
df = df.append(df_temp)
try:
driver.find_element_by_xpath("//button[contains
(@class,'nextButton css-1hq9k8 e13qs2071')]").click()
except:
print('No more reviews')
df['Date'] = pd.to_datetime(df['Date'])
df = df.reset_index()
del df['index']
df = df.sort_values('Date', ascending=False)
df
We get an output as follows.
Text Summarization
To generate summaries from the extracted reviews, we employ the NLTK library and apply various techniques for text processing and analysis. The code snippet demonstrates the process, ensuring compliance with ethical standards and avoiding potential issues with AI text detector platforms.
Importing Libraries
We import essential libraries from the collections module, including pandas, string, nltk, and Counter. These libraries offer robust data manipulation, string processing, and text analysis functionalities, ensuring a comprehensive text summarization workflow.
import string
import nltk
from nltk.corpus import stopwords
from collections import Counter
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))
Data Preparation
We filter the obtained reviews based on the desired role (Software Engineer in our scenario), ensuring relevance and context-specific analysis. Null values are removed, and the data is cleaned to facilitate accurate processing.
role = input('Input Role')
df = df.dropna()
df = df[df['Role'].str.contains(role)]
Text Preprocessing
Each review’s pros and cons are processed separately. We ensure lowercase consistency and eliminate punctuation using the translate() function. The text is then split into words, removing stopwords and specific words to the context. The resulting word lists, pro_words, and con_words, capture the relevant information for further analysis.
pros = [i for i in df['Pros']]
cons = [i for i in df['Cons']]
# Split pro into a list of words
all_words = []
pro_words=" ".join(pros)
pro_words = pro_words.translate(str.maketrans
('', '', string.punctuation))
pro_words = pro_words.split()
specific_words = ['great','work','get','good','company',
'lot','it’s','much','really','NAME','dont','every',
'high','big','many','like']
pro_words = [word for word in pro_words if word.lower()
not in stop_words and word.lower() not in specific_words]
all_words += pro_words
con_words=" ".join(cons)
con_words = con_words.translate(str.maketrans
('', '', string.punctuation))
con_words = con_words.split()
con_words = [word for word in con_words if
word.lower() not in stop_words and word.lower()
not in specific_words]
all_words += con_words
Word Frequency Analysis
Utilizing the Counter class from the collections module, we obtain word frequency counts for both pros and cons. This analysis allows us to identify the most frequently occurring words in the reviews, facilitating subsequent keyword extraction.
# Count the frequency of each word
pro_word_counts = Counter(pro_words)
con_word_counts = Counter(con_words)
To identify key themes and sentiments, we extract the top 10 most common words separately from the pros and cons using the most_common() method. We also handle the presence of common keywords between the two sets, ensuring a comprehensive and unbiased approach to summarization.
# Get the 10 most common words from the pros and cons
keyword_count = 10
top_pro_keywords = pro_word_counts.most_common(keyword_count)
top_con_keywords = con_word_counts.most_common(keyword_count)
# Check if there are any common keywords between the pros and cons
common_keywords = list(set([keyword for keyword, frequency in
top_pro_keywords]).intersection([keyword for keyword,
frequency in top_con_keywords]))
# Handle the common keywords according to your desired behavior
for common_keyword in common_keywords:
pro_frequency = pro_word_counts[common_keyword]
con_frequency = con_word_counts[common_keyword]
if pro_frequency > con_frequency:
top_con_keywords = [(keyword, frequency) for keyword,
frequency in top_con_keywords if keyword != common_keyword]
top_con_keywords = top_con_keywords[0:6]
else:
top_pro_keywords = [(keyword, frequency) for keyword,
frequency in top_pro_keywords if keyword != common_keyword]
top_pro_keywords = top_pro_keywords[0:6]
top_pro_keywords = top_pro_keywords[0:5]
Sentiment Analysis
We conduct sentiment analysis on the pros and cons by defining lists of positive and negative words. Iterating over the word counts, we calculate the overall sentiment score, providing insights into the general sentiment expressed in the reviews.
Sentiment Score Calculation
To quantify the sentiment score, we divide the overall sentiment score by the total number of words in the reviews. Multiplying this by 100 yields the sentiment score percentage, offering a holistic view of the sentiment distribution within the data.
# Calculate the overall sentiment score by summing the frequencies of positive and negative words
positive_words = ["amazing","excellent", "great", "good",
"positive", "pleasant", "satisfied", "happy", "pleased",
"content", "content", "delighted", "pleased", "gratified",
"joyful", "lucky", "fortunate", "glad", "thrilled",
"overjoyed", "ecstatic", "pleased", "relieved", "glad",
"impressed", "pleased", "happy", "admirable","valuing",
"encouraging"]
negative_words = ["poor","slow","terrible", "horrible",
"bad", "awful", "unpleasant", "dissatisfied", "unhappy",
"displeased", "miserable", "disappointed", "frustrated",
"angry", "upset", "offended", "disgusted", "repulsed",
"horrified", "afraid", "terrified", "petrified",
"panicked", "alarmed", "shocked", "stunned", "dumbfounded",
"baffled", "perplexed", "puzzled"]
positive_score = 0
negative_score = 0
for word, frequency in pro_word_counts.items():
if word in positive_words:
positive_score += frequency
for word, frequency in con_word_counts.items():
if word in negative_words:
negative_score += frequency
overall_sentiment_score = positive_score - negative_score
# calculate the sentiment score in %
total_words = sum(pro_word_counts.values()) + sum(con_word_counts.values())
sentiment_score_percent = (overall_sentiment_score / total_words) * 100
Print Results
We present the top 5 keywords for pros and cons, the overall sentiment score, sentiment score percentage, and the average rating in the reviews. These metrics offer valuable insights into the prevailing sentiments and user experiences to the organization.
# Print the results
print("Top 5 keywords for pros:", top_pro_keywords)
print("Top 5 keywords for cons:", top_con_keywords)
print("Overall sentiment score:", overall_sentiment_score)
print("Sentiment score percentage:", sentiment_score_percent)
print('Avg rating given',df['Rating'].mean())
Sentence Scoring
To capture the most relevant information, we create a bag-of-words model based on the pros and cons of sentences. We implement a scoring function that assigns scores to each sentence based on the occurrence of specific words or word combinations, ensuring an effective summary extraction process.
# Join the pros and cons into a single list of sentences
sentences = pros + cons
# Create a bag-of-words model for the sentences
bow = {}
for sentence in sentences:
words=" ".join(sentences)
words = words.translate(str.maketrans
('', '', string.punctuation))
words = words.split()
for word in words:
if word not in bow:
bow[word] = 0
bow[word] += 1
# Define a heuristic scoring function that assigns
# a score to each sentence based on the presence of
# certain words or word combinations
def score(sentence):
words = sentence.split()
score = 0
for word in words:
if word in ["good", "great", "excellent"]:
score += 2
elif word in ["poor", "bad", "terrible"]:
score -= 2
elif word in ["culture", "benefits", "opportunities"]:
score += 1
elif word in ["balance", "progression", "territory"]:
score -= 1
return score
# Score the sentences and sort them by score
scored_sentences = [(score(sentence), sentence) for sentence in sentences]
scored_sentences.sort(reverse=True)
We extract the top 10 scored sentences and aggregate them into a cohesive summary using the join() function. This summary encapsulates the most salient points and sentiments expressed in the reviews, providing a concise overview for decision-making purposes.
# Extract the top 10 scored sentences
top_sentences = [sentence for score, sentence in scored_sentences[:10]]
# Join the top scored sentences into a single summary
summary = " ".join(top_sentences)
Print Summary
Finally, we print the generated summary, a valuable resource for individuals seeking insights into the organization’s culture and work environment.
# Print the summary
print("Summary:")
print(summary)
- Good people, good culture, good benefits, good culture, focus on mental health, more or less fully remote.
- Great WLB and ethics cares about employees.
- Colleagues are really great Non toxic and great culture
- Good WLB , good compensation, good culture
- 1. Good pay 2. Interesting work 3. good work life balance 4. great perks – everything urgent is covered
- Great work life balance, good pay great culture, amazing colleagues, great salary
- Very good work culture and benefits
- Great work life balance , great benefits , Supports family values , great career opportunities.
- Collaborative, supportive, strong culture (ohana), opportunities to grow, moving towards async, technically sounds, great mentors and teammates
As we see above, we get a crisp summary and a good understanding of the company culture, perks, and benefits specific to the Software Engineering role. By leveraging the capabilities of NLTK
and employing robust text processing techniques, this approach enables the effective extraction of keywords, sentiment analysis, and the generation of informative summaries from the extracted Glassdoor reviews.
Use Cases
The text summarization system being developed holds great potential in various practical scenarios. Its versatile applications can benefit stakeholders, including job seekers, human resource professionals, and recruiters. Here are some noteworthy use cases:
- Job Seekers: Job seekers can significantly benefit from the text summarization system, which provides a concise and informative overview of an organization’s culture and work environment. By condensing Glassdoor reviews, job seekers can quickly gauge the general sentiment, identify recurring themes, and make well-informed decisions about whether an organization aligns with their career aspirations and values.
- Human Resource Professionals: Human resource professionals can leverage the text summarization system to efficiently analyze a substantial volume of Glassdoor reviews. By summarizing the reviews, they can gain valuable insights into the strengths and weaknesses of different organizations. This knowledge can inform employer branding strategies, help identify areas for improvement, and support benchmarking initiatives.
- Recruiters: Recruiters can optimize their time and effort by utilizing the text summarization system to assess an organization’s reputation and work culture. Summarized Glassdoor reviews enable recruiters to swiftly identify key sentiments and important aspects to communicate with candidates. This facilitates a more targeted and effective recruitment process, enhancing candidate engagement and selection outcomes.
- Management and Decision-Makers: The text summarization system offers valuable insights for organizational management and decision-makers. By summarizing internal Glassdoor reviews, they can better understand employee perceptions, satisfaction levels, and potential areas of concern. This information can guide strategic decision-making, inform employee engagement initiatives, and contribute to a positive work environment.
Limitations
Our approach to summarizing Glassdoor reviews involves several limitations and potential challenges that must be considered. These include:
- Data Quality: The accuracy and reliability of the generated summaries heavily rely on the quality of the input data. Ensuring the authenticity and trustworthiness of the Glassdoor reviews used for summarization is essential. Data validation techniques and measures against fake or biased reviews are necessary to mitigate this limitation.
- Subjectivity and Bias: Glassdoor reviews inherently reflect subjective opinions and experiences. The summarization process may inadvertently amplify or diminish certain sentiments, leading to biased summaries. Considering potential biases and developing unbiased summarization techniques are crucial for ensuring fair and accurate representations.
- Contextual Understanding: Understanding the context and nuances of the reviews can be challenging. The summarization algorithm may struggle to grasp specific phrases or expressions’ full meaning and implications, potentially losing important information. Incorporating advanced contextual understanding techniques, such as sentiment analysis and context-aware models, can help address this limitation.
- Generalization: It is important to recognize that the generated summaries provide a general overview rather than an exhaustive analysis of every review. The system may not capture every detail or unique experience mentioned in the reviews, necessitating users to consider a broader range of information before making conclusions or judgments.
- Timeliness: Glassdoor reviews are dynamic and subject to change over time. The summarization system may not provide real-time updates, and the summaries generated may become outdated. Implementing mechanisms for periodic re-summarization or integrating real-time review monitoring can help address this limitation and ensure the relevance of the summaries.
Acknowledging and actively addressing these limitations is crucial to ensure the system’s integrity and usefulness. Regular evaluation, user feedback incorporation, and continuous refinement are essential for improving the summarization system and mitigating potential biases or challenges.
Conclusion
The project’s objective was to simplify the understanding of a company’s culture and work environment through numerous Glassdoor reviews. We’ve successfully built an efficient text summarization system by implementing a systematic method that includes data collection, preparation, and text summarization. The project has provided valuable insights and key learnings, such as:
- The text summarization system provides job seekers, HR professionals, recruiters, and decision-makers essential insights into a company. Distilling many reviews facilitates more effective decision-making by thoroughly understanding a company’s culture, work environment, and employee sentiments.
- The project has shown the effectiveness of automated methods in gathering and analyzing Glassdoor reviews by using Selenium for web scraping and NLTK for text summarization. Automation conserves time and effort and enables scalable and systematic review analysis.
- The project has underscored the significance of understanding the context in accurately summarizing reviews. Factors such as data quality, subjective biases, and contextual nuances were addressed through data preprocessing, sentiment analysis, and keyword extraction techniques.
- The text summarization system created in this project has real-world applications for job seekers, HR professionals, recruiters, and management teams. It facilitates informed decision-making, supports benchmarking and employer branding efforts, enables efficient evaluation of companies, and provides valuable insights for organizational development.
The lessons learned from the project include the importance of data quality, the challenges of subjective reviews, the significance of context in summarization, and the cyclical nature of system improvement. Using machine learning algorithms and natural language processing techniques, our text summarization system provides an efficient and thorough way to gain insights from Glassdoor reviews.
Frequently Asked Questions
A. Text summarization employing NLP is an approach that harnesses natural language processing algorithms to generate condensed summaries from extensive textual data. It aims to extract crucial details and principal insights from the original text, offering a concise overview.
A. NLP techniques play a pivotal role in text summarization by facilitating the analysis and comprehension of textual information. They empower the system to discern pertinent details, extract key phrases, and synthesize essential elements, culminating in coherent summaries.
A. Text summarization utilizing NLP proffers several merits. It expedites the process of information assimilation by presenting abridged versions of lengthy documents. Moreover, it enables efficient decision-making by expounding upon crucial ideas and streamlines data handling for improved analysis.
A. Key techniques employed in NLP-based text summarization encompasses natural language comprehension, sentence parsing, semantic analysis, entity recognition, and machine learning algorithms. This amalgamation of techniques enables the system to discern crucial sentences, extract significant phrases, and construct coherent summaries.
A. NLP-based text summarization is highly versatile and adaptable, finding applications across various domains. It effectively summarizes diverse textual sources, such as news articles, research papers, social media content, customer reviews, and legal documents, enabling insights and information extraction in different contexts.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
By Analytics Vidhya, June 19, 2023.