Pornography Analytics (Part1) — Web Scraping and Descriptive Analytics

10 min readJan 31, 2022

Create Dataset from Xvideos via Web Scraping and perform Exploratory Data Analysis (EDA).

A google search tells us that Pornography is a $97 Billion industry, and most of its revenue is generated from digital porn. Some popular digital pornography sites include Xvideos, Pornhub, xnxx and Xhamster, as reported in the research paper titled “The impact of COVID-19 pandemic on pornography habits: a global analysis of Google Trends” published in the International Journal of Impotence Research. The article also documented an increase in pornography consumption worldwide since Covid-19. Social Isolation, psychological distress, and loneliness were common reasons attributed to this increase in consumption. Although Pornography in general and the sites, in particular, are popular behind closed doors, they have not been much included in software engineering or data science projects. NudeNet is a project worth reading and exploring. This article/guide attempts to include them in mainstream talks or projects.

This article/guide will be a two-part article/guide. Part 1 will cover Web Scraping data from Xvideos into a structured format and Descriptive Analytics (Exploratory Data Analysis) on the data.

Part 2 will use the dataset created in Part 1 and Cluster countries depending on the content they consume using Kmeans (Optimal number of Cluster found using Knee Point Detection and Silhouette Score). The Clusters will be then visualised using Geographical maps and WordCloud.

Web Scraping from Xvideos

Website and Data

Xvideos provides personalised content based on the user’s country. When a user visits the website, the country corresponding to the IP address is chosen automatically, visible on the top left corner of the webpage, as shown in the screenshot attached above (left). People can change this chosen country from the list of countries; some are shown in the screenshot attached above (right). There are around 240 countries from which one can choose. Changing the country on the website changes the videos on the ‘HOMEPAGE’ and the ‘BEST’ section. This video change was manually checked for a few countries, and I hope it works for every country on the list. Our future research assumes that the personalisation of videos happens for each country, as it was found for some countries.

For each country on the list, we will Scrape data from 10 web pages - 5 pages from the homepage and 5 pages from the Best section. From each video, on any web page, the following attributes or features will be stored:

TITLE — The title of the video.

QUALITY — The Quality of the video (360p, 720p, 1080p, 4K).

UPLOADER — Id of the uploader who uploaded the respective video.

VIEWS — The number of times the video was played.

CATEGORY — HOME (from the home page )or BEST( from Best section)

COUNTRY — Country for which the video was scraped.

PAGE_NUMBER — If the Category of the record is ‘HOME’, it will represent the Page number else; if the Category is ‘BEST’, then it means the month.

The description of ‘CATEGORY’ and ‘PAGE_NUMBER’ can sometimes be confusing but makes more sense when looked at together, eg.

When the country is changed, the first visible page is termed HomePage and its page number’s 1. When Category = ‘HOME’ and ‘PAGE_NUMBER’ = 3 represents the ‘Page Number 3 of Homepage’. Similarly, Category = ‘BEST’ and ‘PAGE_NUMBER’ = 3 describes the ‘First Page of October’. I hope it makes a little more sense now.

Every website tries to increase user engagement by making products or videos, in this case, personal. This personalisation can happen at the user level or a more general level, such as the country in this case. Since the website tries to maximise views by personalisation, the content on the website can also act as a representative of people, which serves as the basis for our research. The rationale behind choosing the category ‘HOME’ is that the website would place the most relevant content on the landing page to hook the user to the website. The most relevant content should describe the majority of the population. The ‘BEST’ Category is chosen because the homepage keeps changing due to current trends, whereas the Best Videos by month remain the same. For the ‘BEST’ Category, the data for the current month is not scraped; data for the previous 5 months is chosen.

Pornhub, another website, was also considered for the dataset, and they may also use personalisation based on the country. Still, there wasn’t an option to view this personalised content for all countries by changing some parameters so viewing a different link.

Web Scraping

This project uses Selenium 4.1.0 for web scraping. Some changes are required while using a different version of the Selenium.

BASE URL

Upon inspecting URL’s for some of the countries, we could generalise the pattern, i.e. a common URL and the changing variable. Here ‘af’ and ‘ar’ are the variables, and the common URL is saved as ‘BASE_URL’.

Afghanistan — https://www.xvideos.com/change-country/af

Argentina — https://www.xvideos.com/change-country/ar

BASE_URL = 'https://www.xvideos.com/change-country/'

Since the ‘<ul>’ tag contains all the countries’ names and URLs, we could parse the changing variable for each country. A dictionary was created with the country names and their changing pattern, stored in the ‘COUNTRIES_DICT’.

<ul> tag containing all the nations and their changing pattern.

COUNTRIES_DICT = {'Afghanistan': 'af', 'Chad': 'td', 'Greece': 'gr', 'Libya': 'ly', 'Pakistan': 'pk', 'Sudan': 'sd',...}
## sample of the countries dict

SCRAPE ONE PAGE -> SCRAPE COUNTRY -> SCRAPE ALL COUNTRIES

Since now we have the common URL and changing pattern for each country, we scrape for one country and then modify code to loop through all countries. Since all the pages are identical, we need the code to parse for a single page, and then we could parse all pages for the country.

SOURCE CODE OF A WEBPAGE

To get the source code of the webpage, we need to create an instance of Selenium (driver) and navigate to the required page.

Lines 1–4 are the required imports for the Selenium. Line 5 creates a service instance to be used by Selenium, which requires the path of the ‘webdriver’ for the browser(chrome/firefox/Edge). Please follow Selenium documentation to install the required driver and keep in mind the drivers are designed specifically for the browser and builds of that browser. In line 7, we specify parameters for an Incognito window so that no data is stored and scraping for one country does not affect others. Line 8 creates a Selenium’s driver, which does all the work. The service instance and the parameters created earlier are passed to this driver. Line 9 opens a browser window using the driver’s ‘get’ function. The URL passed to the get function combines the BASE_URL and the variable extension for each country.

BASE_URL + COUNTRIES_DICT.get('United Kingdom')
## 'https://www.xvideos.com/change-country/' + 'gb' = 'https://www.xvideos.com/change-country/gb'

If everything goes well, a new window will open and look like the image given below. ‘United Kingdom’ would also be visible on the left corner as in the screenshot below because we added variable extension for the United Kingdom.

Click on the disclaimer and enter the site.

Making the first click using Selenium is all you need to figure out, then it’s all the same; change some parameters and boom. (Follow the arrow in the screenshot)

Open the ‘Developer tools’ tab (Ctrl + Shift + I for chrome).

Click on element marked with 1. It will highlight any element you click or hover over on the webpage.
Click on the disclaimer section “I am 18 years …” as marked 2 on the screenshot.
The code for the disclaimer section would be highlighted(<button>…</buton>) as marked in the screenshot by 3.
Use right-click on the highlighted code section and choose ‘Copy’ as marked on the screenshot as 4. Select ‘Copy XPath’ as marked with 5.

This way, you can get ‘XPath’ for any element you want to click on the website. If you have the XPath, use the code below to make a click.

Combining all the code from above, now this code would open the website in a new incognito window and click on the disclaimer section. Now we are into the website. Now we need to Scrape details of all videos on that webpage.

The function ‘get_data’ uses source code from the webpage and returns a list. Each element of the list is a dictionary that contains details of a single video. It takes in the source code of the page as ‘page_source’, the list to which new dictionaries will be added as ‘DATA’, ‘HOME’/‘BEST’ or any other category as ‘category’, the name of the country we are scraping for as the ‘country’ and page number of the page as ‘page_num’. The dictionary comprises of the keys TITLE, QUALITY, UPLOADER, VIEWS, CATEGORY, COUNTRY, PAGE_NUMBER, as discussed earlier.

Now, we have code to open an incognito window, click on the disclaimer, download the source code for that page and scrape all videos from that page. The next step is to navigate from this page to other pages and scrape their data. The code below does that exactly, and we will look briefly at how it does that.

In Line 1, we create an empty list that will be passed to function ‘get_data’ and will contain all the scraped data. Line 2 creates a variable ‘country_name’ which stores the country we are scraping for. Line 3–9 creates an incognito window, get’s the webpage and click’s on the disclaimer. Line 11 uses an implicit wait till elements of the webpage are loaded. Line 12 uses the function ‘get_data’ and scrapes all the data of the videos from the page. Line 14–17 clicks the link for page number 2, downloads the source code and scrapes the page. Lines 19–32 performs the same process for page numbers 3,4, and 5. Lines 34–39 makes the click to the ‘BEST’ section and scrape the page. The rest of the code scrapes page number 1 for November, October, September and August. Line 61 closes the driver and the window.

PARSE VIDEOS FOR ALL COUNTRIES

Since we have the code to scrape data for a country, we use it to scrape data for all countries via a loop (for-loop ). Additionally, code for exception handling and saving the data into a CSV by converting data from a list to a dataframe is also added.

Exploratory Data Analysis

Before diving into the more significant problem, it is good to know about the dataset. That’s what this section is all about. First, we check if there is a need to pre-process any or some features, then we proceed to find basic statistics and answer some of the questions such as popular videos, uploader with most videos, uploader with highest average views, etc.

Read the dataset.

Reading the dataset and viewing head to check the data.

2. Although understood by humans, the ‘Views’ feature represents a numeric quantity, so we should convert the feature into a numeric type readable by computers.

Code to convert views in the string format to Numerical format — Raghuvansh Tahlan

Firstly, the number is extracted from the string format and converted to float. The number is then returned after being multiplied with the respective multiplier (1000 for K and 1000000 for M). The map function is then used to apply this function to all the records in the dataset. Occasionally the numbers returned are in the millions, so it was more beneficial to divide each value by 1000 to improve the readability.

3. Number of Unique Videos: Although there are 70K records in the dataset, some of the videos are present in either or both of the Categories’ HOME’ and ‘BEST’. Videos may also be present in records for multiple countries. E.g., the video in the screenshot below was found for 186 countries

Screenshot of the video which was found in 186 countries

After dropping the duplicate video titles using the subset argument in the drop_duplicates function, we get 8447 unique video titles.

df.drop_duplicates(subset='TITLE')## Method 1
df['TITLE'].drop_duplicates() ## Method 2

4. Missing Attributes: Out of these 8447 unique video titles, 267 of them have the missing ‘QUALITY’ attribute.

df.drop_duplicates(subset='TITLE').isna().sum()

5. Number of videos of different Quality: Around 50% are in 1080p with only 1 4K video.

6. Top 10 Uploaders: There are a total of 4157 individual uploaders with ‘Model Media’ uploading a maximum of 55 videos.

len(df.drop_duplicates(subset='TITLE')['UPLOADER'].unique())

Visual Representation of the Number of Video uploaded by Top 10 uploaders

7. Top 5 most-watched videos: The most-watched video was uploaded by ‘Bangbros Network’, which was viewed over 93 million times.

df.loc[:,['TITLE','VIEWS','UPLOADER']].drop_duplicates(subset='TITLE').sort_values('VIEWS',ascending=False).head(5)

9. Popular Uploaders: I believe individuals can be considered uploaders if they have uploaded videos regularly. The uploaders who have uploaded more than 4 videos were filtered to find popular uploaders. Then these uploaders were ranked based on the mean views. Mean views were calculated by dividing the uploader’s total views by the number of videos uploaded. ‘Bangbros Network’ was the most popular uploader with average views of over 26 million for each uploaded video.

This concludes Part 1 of the guide. Feel free to share your views by connecting with me over LinkedIn or adding your comments. All the code and dataset used in the guide is added to Github.

Raghuvansh Tahlan - University of Warwick - Coventry, England, United Kingdom | LinkedIn

I am pursuing MSc in Data Analytics from the University of Warwick. I have completed Computer Science Engineering from…

www.linkedin.com

Medium_Articles/Pornography Analytics at main · rvt123/Medium_Articles

Contribute to rvt123/Medium_Articles development by creating an account on GitHub.

github.com