There was a costume voting event in a mobile SRPG game called Browndust. The voting was held in Browndust Official NaverCafe
and people were able to vote in the format ‘[username] / [unitname]’. There were over 1800 comments and I thought to myself that
making a web crawler here wouldn’t be so bad. There was a slight problem when trying to crawl in NaverCafe because the url are
hidden, so I could not use a orthodox method of going reading html of each unique url. This is where I approached Selenium.
After crawling with Python and Selenium, and making a simple dataframe, I used R and Highcharter to visualize the
result!
Web Crawling
I used Chrome as Webdriver.
Setting up for Crawling
There isn’t anything particularly special until here. User defined method is specific to NaverCafe comment page sector,
so just keep that in mind. If you are interested… the comment page sector is labelled as nchild, and the nchild is slightly different depending on how many pages are
Crawling with Selenium
If you are a programmer with a bit of knowledge in selenium, you might have some issues on how this code was built…
Here are some explanations on why such code was inserted in some areas. If I see any additional questions in comments,
I will add them when I have the time! :^)
1. Why do you need time.sleep when you’re already using WebDriverWait?
The xPath for current comment page and the next comment page are the same, so before it goes on to the next comment page, it checks xPath for current comment page (which will return True and continue that results as an error). So in order to wait for not checking for xPath in the current page, added time.sleep.
2. Why do you need send_keys, when you only need to do click?
If the clicking element is not currently viewed on the browser (physically), it returns an error. I’ve googled about this problem, and other people also had this problem. So, the send_keys will scroll the browser to actually see the element that’s about to be clicked, which solves the error.
Exporting Dataframe as csv
The output df looks something like this
Data Handling and Visualization
Data Handling
There are some users who didn’t write just unitname, but some extra texts such as ‘unit x in maid costume’, since those
comments did not follow the rules, I’ve separated them into a formal / informal dataframe (it’s possible to distinguish them
with grep and apply, so if needed this is possible)
Data Visualization
Since I uploaded the results in the NaverCafe, results are in korean, but those are just unitnames and titles.
Due to too much data, removing unit votes less than 10
Due to one unit getting overwhemling votes from users, there was no need to go through the informal dataframe to extract
votes. (better for me XD)
Comments