Pornography Analytics (Part 2)- Clustering and Word Cloud

12 min readJan 31, 2022

Clustering different countries based on the content they consume on Xvideos using TFIDF and K Means and visual representation using Geographical Maps and WordCloud.

Building upon the learnings from Part 1 of the guide, in this part we will cluster different countries based on the titles of the videos. Titles of the videos were first cleaned, tokenised and then passed through TFIDF to extract features. These features were then used by the Principal Component Analysis (PCA) to condense the information, and then clustering was performed using K Means. Some assumptions are made to narrow down the study’s scope from time to time.

Clustering Countries

Hoping the content on the website is personalised to cater to the liking of the users in the country. This would also mean that title of the videos would represent the average viewer in the country.

Approach — Text Cleaning, Combining records, Feature Engineering, Dimensionality reduction and clustering.

Text Cleaning

Remove links: Sometimes, links are added in the video’s title, which doesn’t add any information to the titles or our analysis. So these are removed using the function sub from re module.

Code to remove links from the titles/text — Raghuvansh Tahlan

2. Conversion to Lowercase: All text is converted into lowercase or uppercase to normalise the text. Empirically there is no difference between ‘Text’, ‘text’, ‘TEXT’ or any other combination but computers treat it differently if not handled/normalised.

Code to convert all text into lowercase — Raghuvansh Tahlan

3. Contractions Expansion: Contractions such as ‘ won’t ’/‘ can’t ’ etc., are expanded to ‘ would not ’/‘ cannot ’ to normalise the text. The list of contractions is added to Github in the pickle format.

Code to remove contractions from the text — Raghuvansh Tahlan

4. Punctuation removal: Punctuations can sometimes carry information but not usually. Also, punctuation would increase the number of tokens to remove using the re module.

Code to remove punctuations from the text — Raghuvansh Tahlan

5. Only Alphanumeric Characters: The titles of the videos contain words from the regional languages too, so it would be challenging to pre-process them because that would require domain expertise and knowledge of the language. So only Alphanumeric characters in English are only kept for the analysis. One could also try removing numbers too.

Code to remove everything other than Alphanumeric English characters — Raghuvansh Tahlan

Combining Records

After cleaning the records, all records are combined or joined together to form 1 record for each country. It’s performed using the ‘groupby’ method and aggregation using the ‘ join ’ method. Any whitespace added while combing the records is also removed.

Code to combine records for each country — Raghuvansh Tahlan

Tokenisation, Feature Engineering and Dimensionality Reduction

When the cleaned records are combined for each country, the resulting dataset contains only 242 records. Each record in the resulting dataset, the combination of video’s titles, is split into tokens using the function ‘word_tokenize’ from the ‘nltk’ module. Each record contains, on average, 3000 tokens.

Code to split records into tokens — Raghuvansh Tahlan

The TFIDF vectoriser then processes the generated tokens to create sparse features, 12000 in this case. These features are stored in the sparse matrix converted into a NumPy array using the ‘toarray’ function.

Since the created features are sparse and high in number, I wouldn’t recommend using all of them in K Means. It would also be time-consuming since we don’t know the number of clusters to build, so these many features would hinder finding optimal groups. It would be much more beneficial to first pass them through a dimensionality reduction technique like Principal Component Analysis.

Choosing the optimal number of Principal Components is an essential task while using PCA. Fewer components wouldn’t explain the necessary variance, and more would burden the system. ‘Explained Variance ratio’, an attribute of the PCA object, represents the variance explained by the object to the total variance in the data. A plot representing the explained variance with the number of components is used to choose an optimal number of components.

Code to find an optimal number of components in PCA — Raghuvansh Tahlan

The plot seems almost non increasing when the number of components exceeds 30. There is no fixed rule, but in my opinion, any number in the vicinity of 30 would be good. Using the number of components as 30 gives an explained variance ratio of 0.953.

Plot representing explained variance ratio vs number of components

As we discussed, the optimal number of clusters is unknown. Still, a few methods such as the Elbow Curve Method and Silhouette analysis greatly help find the optimal number of clusters. In the Elbow Curve method, the ‘inertia’ or the Within-Cluster-Sum of Squared Errors (WSS) of the ‘k means’ object is plotted against the number of clusters to find the optimal number of clusters. Inertia represents the average distances to the centroid across all data points. When this distance falls suddenly (Elbow), the respective number of clusters is chosen.

Code to find and plot the inertia of different clusters for k ranging from 1 to 30. — Raghuvansh Tahlan

One could find this Elbow point visually, but it can sometimes be inaccurate due to the complex plot, or different people may interpret it differently. However, there are many different ways to computationally identify the inflection point in an elbow plot; Three standard methods are the Silhouette Coefficient, the Calinski Harabasz score, and the Knee point detection.

Plot for the SSE/inertia to number of clusters ranging from 1 to 50 — Raghuvansh Tahlan

Knee Point Detection or Kneedle Algorithm

The Kneedle algorithm (Satopaa et al., 2011) is a generic tool designed to detect “knees” in data. In clustering, the knee represents the point at which further clustering fails to add significantly more detail. However, in other fields, knees represent the point at which there’s a trade-off between spending money developing a system or product and its performance improving significantly.

An accurate implementation of the Kneedle algorithm is available in the package ‘kneed’. We can use the ‘kneeLocator’ function to find the optimal number of clusters.

pip install kneed # to install the kneed package

Code to find the optimal number of clusters using Kneedle algorithm from Kneed package — Raghuvansh Tahlan

Kneedle algorithm provides the optimal number of clusters as 14. But one could always try other numbers in the vicinity of this number or confirm this number with another algorithm.

Silhouette Coefficient

The silhouette coefficient measures how similar a data point is within the cluster (cohesion) compared to other clusters (separation). It can be calculated using the ‘silhouette_score’ metric from the sklearn module.

Code to find and plot the silhouette_score of different clusters for k ranging from 2 to 30. — Raghuvansh Tahlan

The ‘silhouette_score’ generally increases and achieves a local maximum for 16 clusters. Then there is a sharp drop and rise in the score to achieve another local minimum when clusters equal 20. Then it attains the global maximum when the number of groups equals 39. Between the two values of clusters 16 and 20, there is not much increment in the score, so 20 can be discarded. Between points 16 and 39, there is a significant difference of 23 clusters, but there is an increase of 14% in the score. This increase in the score may seem significant or insignificant depending on the person and task at hand. Also, in most cases, clusters need interpretability which would be a lot easier when the number of clusters is 16 compared to 39.

Plot representing the silhouette_score to the number of clusters ranging from 2 to 50 — Raghuvansh Tahlan

Cluster Visualisation using Geographical Maps

Since we cluster different countries, it would be best to view them on a Geographical (World) Map. To view other geographical maps in Python, we use a module, ‘geopandas’.

geopandas (Installation): Its installation is tricky compared to the other modules. It’s advisable first to try installing it using ‘conda’, which is the package manager for the Anaconda distribution of Python.

conda config --prepend channels conda-forge

If installing using ‘pip’, I would recommend following an excellent guide from Geoff Boeing, which is for Windows distribution. For other operating systems, either try installing using ‘conda’ or just Google. I hope everyone finds their way through it.

Another requirement for using geopandas is the files required for creating maps. For this guide, the file used is added to Github(TM_WORLD_BORDERS-0.3.zip); alternatively, one could also download the file from the Thematic Mapping download’s section.

Importing geopandas: Some users may face issues while importing geopandas due to circular imports. So it’s recommended to first import the package ‘fiona’ and then ‘geopandas’.

import fiona
import geopandas

You can import the file and plot the world map using the ‘plot’ function if everything goes well.

world = geopandas.read_file("TM_WORLD_BORDERS-0.3.zip")
world.plot()

Plotting Clusters with geopandas

Code to geographically plot the clusters on the world map — Raghuvansh Tahlan

Firstly, cluster labels are added to the dataframe for each country. Then unique values of cluster labels are added to the list. Due to some difficulty plotting the default legend, matplotlib’s Patches are used to plot the legend. An empty plot with a figure size of 15 X 15 is created, and the default colour is set to ‘grey’. Clusters are then plotted with their respective colours.

Clusters plotted on the world map — Raghuvansh Tahlan.

Depending on where are you from, there could be expected and unexpected observations. If you would like to share some surprising observations feel free to comment or send messages on LinkedIn or Twitter, I would be more than happy to hear them.

Cluster Visualisation using WordCloud

By visualising clusters using Geographical maps, we could see which countries belonged to which cluster, but we could not know how the clusters differed. WordCloud could be a handy tool to represent clusters created on the text. The more times a word/token occurs in the text, the bigger the size of the token/word in the WordCloud.

To create WordCloud, we use a module/library called ‘word_cloud’, which can be downloaded using ‘pip’ or ‘conda’. To make a WordCloud, we pass the text and other parameters to the function, returning the Image.

Code to create WordCloud by Combining all text of the CLuster — Raghuvansh Tahlan

Here, the Image (left side) represents ‘Cluster 10’, which contains 33 countries: Australia, Canada, Finland, Germany, Ireland, Israel, Spain, Sweden, Switzerland, United States, and the United Kingdom.

On the right side, the Image represents ‘Cluster 15’, which contains 18 countries: China, Hong Kong, the Republic of Korea, Oman, Singapore, Sri Lanka, Taiwan, Tajikistan, and Uzbekistan.

The left Image represents CLuster 10, and the Right Image represents Cluster 15 — Raghuvansh Tahlan.

What’s wrong

We found these clusters based on 5 pages from the ‘BEST’ section and 5 pages from the ‘HOME’. One could reasonably argue that this model/study does not consider the page number or the position on the page where the video was found. A video in an ‘X’ country may be on the 1st page, and for another country, ‘Y’ may be on the 3rd or 5th page, and both cases are treated equally. Even if they are on the same page, for one country, it could be the first video for another, maybe last. This uncertainty is unknown how videos are ranked or positioned on the pages. One could also argue that we should consider only the ‘BEST’ section’s videos because they summarise the trend for the whole month, compared to the ‘HOME’ section, which changes very frequently and is affected by the Global movement. Another challenging argument could argue that the ‘BEST’ section could be more affected by Global trends because videos popular globally would only reach the ‘BEST’ section for the month, whereas the ‘HOME’ section would capture the local popular audiences.

What can we do?

Since I could not find similar studies, we could try different experiments and report the frequently occurring observations.

NOTE: All code for the experiments is present in the jupyter notebook uploaded on Github.

Experiment 1: Only Considering data from the first page of the ‘HOME’ section, which means the first page anyone sees after entering the site.

Since the data was less than earlier, it required more components in the PCA to achieve a similar explained variance ratio. 36 principal components could explain 0.949 of the variance in the original data. According to the ‘Kneelocator’ function on the Elbow plot, 13 clusters would have been appropriate. Still, after confirming it with the ‘silhouette_score’ plot, 16 clusters were found optimal, coinciding with the number of clusters when all of the data was used.

Left Image — Cluster 4 (15 Countries) — Australia, Cyprus, Denmark, Finland, Greece, Greenland, Iceland, Latvia, Lithuania, and Norway. (some of them)

Right Image — Cluster 10 (27 Countries) — Argentina, Brazil, Canada, Israel, New Zealand, Spain, United States and the United Kingdom. (som of them).

The left Image represents CLuster 4 and the Right Image represents Cluster 10 — Raghuvansh Tahlan.

Experiment 2: Considering all data from the ‘HOME’ section/category.

The data for this experiment was more than Experiment 1, so I expected some changes in the parameters, but surprisingly if I had made those changes, they would have looked like forced ones.

Left Image — Cluster 2(48 Countries) — Australia, Canada, China, Italy, Japan, Malaysia, New Zealand, Poland, Portugal, Serbia, United States and the United Kingdom. (some of them)

Right Image — Cluster 9(16 Countries) — Hong Kong, Republic of Korea, Macau, Maldives, Singapore, Sri Lanka, Uzbekistan. (som of them)

Experiment 3: Considering all data from the ‘BEST’ section/category.

Now, this was again unexpected. I expected a similar number of principal components and clusters, but the number of principal components decreased from 36 to 25, and the number of clusters increased from 16 to 18. When a dataset is complex, there is much variance/information; PCA finds it hard to reduce the components; that’s why we get more components. Here, the number of components decreased for the almost exact amount of explained variance, indicating less variance or similar data. It could be because the videos in the ‘BEST’ section were only present after they had garnered views for the entire month, removing some noise in terms of the videos that gained viewership for a short duration. This effect is similar to moving averages on the time series data.

Left Image — Cluster 5(22 Countries) — Australia, Austria, Brazil, Canada, Netherlands, New Zealand, Switzerland, United States, and the United Kingdom. (some of them)

Right Image — Cluster 15(7 Countries) — Indonesia, Japan, Malaysia, Thailand and Vietnam. (som of them)

COMPARING CLUSTERS

We used two methods for finding the optimal number of clusters — The elbow Method and the Silhouette Score; more often than not, we have given preference to the Silhouette score. To compare different clusters, I think it would be fair to compare their Silhouette score.

Since in ‘Experiment 1’ (only first page of ‘HOME’ section ), data used was less than our original study and other experiments, it’s not considered in the comparison.

Between our initial research and the other two experiments, ‘Experiment 3’ achieved the highest Silhouette Score, which indicates how well clusters are formed. But upon inspecting the clusters more carefully, I am inclined to accept that the high score is because the ‘BEST’ section contains videos on the Global trend, not on the significant local trends we were interested in.

We always assumed that the website was customising data for each country, but actually, it was only done for the major regions. When the video title’ text was aggregated for each country, and the ‘drop_duplicates’ function was applied, it reduced the unique texts significantly. For our original experiment, when all the data was considered, there were 78 unique texts, which way lower than 242 countries were hoping to find. The ‘Experiment 2’ and ‘Experiment 3’ data had 71 and 43 unique combined titles, respectively. Since the number of unique titles in ‘Experiment 3’ was only 43, it meant minor variance, making it easier for Principal components to reduce the dimensions. Thus, the clusters formed had a greater Silhouette score.

This concludes the guide. Feel free to share your views by connecting with me over LinkedIn or adding your comments. All the code and dataset used in the guide is added to Github.

Medium_Articles/Pornography Analytics at main · rvt123/Medium_Articles

Contribute to rvt123/Medium_Articles development by creating an account on GitHub.

github.com