Comparing Youtube likes to Comments sentiment using AWS

6 min readDec 9, 2020

Introduction

What could you possibly use AWS for? This seems like a ridiculous question, but it really dazzled me when we got the task of finding a problem that AWS can solve and then solve it as a homework for our data engineering class. Usually it goes the other way around. You have a problem, that you want to solve, and you find the right tools for the best solution. It is very much like when you have a painting that you want to put on your wall you look for nails or screws, drive them in the wall with a screwdriver or hammer and then hang your painting there. Except now you had to also figure out what you want to use your nails for.

Anyhow, lately I’ve been spending an embarassing amount of time on youtube so I thought, hey, what if I could get more insight about how individual videos perform on a channel. And what better way to do that, then to investigate one of my favourite Among Us content creator’s, Hafu’s, Among Us playlist.

Overview

My plan was to combine the Youtube Data API’s power with AWS Comprehend using R. I will create a request to download the comments and detailed statistics on each video which I will then send to AWS to analyze the sentiment and generate the report from R.

Data collection

Google Dev Console

First of all I needed to create a project in Google Dev console and generate credentials for the API. I had done this already before so it wasn’t very difficult, but if you need guidance I recommend this article, it came in handy later for me.

Youtube Data API

I haven’t used this API before so I started with taking a look at the documentation and I recommend you do the same if you aren’t familiar with this service. You can explore the API, and familiarise yourself with the possible requests that you can make. I can also recommend Postman, which is an app that lets you create requests and stores them so you can later revisit them. My original plan was to make direct requests to the Youtube API using the httr package, but while working on it, I found a very handy package called tuber, that has already pre-written functions to make requests easier. If you are interested in this package, you can find a concise tutorial on this link.

Prepare workspace for the script

Next, we are going to download data from Youtube using the API

I am going to download a whole playlist, that includes Among Us gameplays. This does not take too much time, however later it will, so you can also choose to download data on just one video.

After downloading the raw data, I extracted the video id-s and created three more requests, that returned stats on each video (like views, likes, dislikes), details about the video (like title, creator) and of course, the comments. The plan is to send the comments to AWS, get the sentiments, and then combine all three datasets together (after aggregating individual sentiment scores for comments). Lets see how we can use AWS Comprehend to get those scores!

AWS Comprehend

First of all, If you have never used AWS before, you need to download your access keys similar to how you did it before with the Google Dev console. These will also need to be loaded into the environment so you can make calls to AWS’s API. After doing this, you can send your requests to AWS if you have loaded the aws.comprehend package earlier. Because there are comments longer than the allowed maximum limit of 5000 characters, we will have to deal with this issue. For now, I just discarded the remaining parts and use the first part to estimate the sentiment. Requesting the sentiment analysis takes a long time, because we are sending requests for each comment (~hundreds) in each video (68 in this playlist).

Data Analysis

Cleaning the data

Now that we have sentiments for each comment, we just need to wrangle our data so we can easily extract the valuable information that we seek. I have pasted my short code snippet below, if you want to take a look at it. As you can see I used weighted average sentiment confidence probabilities, with the number of likes that comments got as weights. The reason behind this is, that perhaps comments with higher likes are more important, because more people agree with it.

Analysis

And finally, we reached the exciting part of our little research. Is there any meaningful relationship between the number of likes/dislikes and positive/negative sentiment comments. Before we jump right to that, I wanted to see how our data looks like so I created a few plots, where I visualised the sentiments and different metrics.

On the first plot below, I colored each video with the color of the sentiment that had the highest number of comments. So if there were mostly positive comments on a video it is colored green, if mostly negative then red, etc. It’s very interesting to see that most videos are neutral, even very popular videos. Let me get back to this in the conclusion.

I also wanted to see the like-dislike ratio, and here you can see, that most of the videos have very little dislikes and very volatile likes.

And finally, lets see if there is any relationship between likes and views. There is some, but still it is very hard to explain number of likes with the sentiment of the comments. Also, it is interesting to see, that the number of likes do not increase significantly even when the number of views do.

And my correlation results show the same thing, there is practically 0 correlation between our variables of interest.

Conclusion

So it seems like there is no relationship at all, which is very unfortunate. However, it is worth mentioning a few takeaways here, and take into consideration those when we evaluate our project as a whole.

First of all, our aggregation methods were crude, to say the least. For example, when we had to create shorter requests for AWS comprehend, we simply ignored the part after 5000 characters. We could create more advanced requests, that return sentiments for the whole comment. For this we would either need an AWS account with more privileges or aggregate the sentiment data before returning it.

Also, it is very likely that sarcastic comments that are meant to be funny, or inside jokes don’t get detected by the engine and that can also distort the results. In hindsight, such videos were not the best target for sentiment analysis.

So we can not say decisively that there is no relationship at all, maybe our methods were not refined enough to detect this relationship. We might improve our models, if we continued data collection perhaps using a wider sample. For this it would be useful to store existing data and just get updates on new comments and video statistics. This would also significantly improve response times!

All in all, this was a fun and insightful project, and a good start for writing more blogposts here!