Detecting the most popular tourist attractions in Valencia using unsupervised learning techniques

DSCAN algorithm for discovering the most popular photographed locations in Valencia (Spain).

Published in

Towards Data Science

9 min readNov 14, 2021

Valencia is one of the most cosmopolitan and vibrant cities in Spain. Located on the Mediterranean coast, Valencia is the third-largest city and metropolitan region in Spain. The city has been the home of many different cultures over its more than 2000 years of history. Romans, Visigoths, and Muslims occupied Valencia, leaving as inheritance a rich collection of art and a distinct architectonic patrimony. At present, Valencia is a popular spot for tourists, receiving roughly 2 million visitors every year.

In this article, we will analyze the most relevant spots in Valencia using the geoinformation of the photographs provided by Flickr. To do so, we will use the algorithm DBSCAN, an unsupervised learning technique that provides clusters of data based on density.

Github

The code for this project is available as a Jupyter Notebook on GitHub.

GitHub - amandaiglesiasmoreno/dbscan_photos_valencia

You can't perform that action at this time. You signed in with another tab or window. You signed out in another tab or…

github.com

Flickr API

Flickr is one of the most popular photo-sharing sites. To use the Flickr API, you need both a Flickr API key and a Flicker user ID. Once you have a user ID and an API key, you can search for images using the flickrapi library.

The first part of the project consists of importing all the libraries needed along with the creation of a FlickrAPI object.

Obtaining the photos using the `flickr.photos.search` function

The next step consists of obtaining the photos of interest using the Flickr API. The flickr.photos.search function returns a list of photos of Flickr’s repository based on given parameters. In order to simplify the analysis, we will only download from the repository the photos taken in Valencia in 2019. We can specify the date of interest (2019) using the arguments (1) min_upload_date, and (2) max_upload_date. Additionally, we are going to download only public photos (media='photos' and privacy_filter=1) available on the platform, excluding videos and private photos. Finally, we provide the boundary box of Valencia to guarantee that only photos of this specific region are retrieved.

Defining the Bounding Box of the area

As shown in the code above, we need to provide the boundary box of Valencia to the flickr.photos.search function (parameter bbox) to retrieve only the photographs taken in this region. To obtain the coordinates of the area, we can use the OpenStreetMap database. After manually drawing the region of interest on the map, the boundary box coordinates (bottom-left longitude, bottom-left latitude, top-right longitude, top-right latitude) will appear in the text box located in the upper left corner, as you can see below. Then, we need to provide those coordinates to the bbox parameter as a string, where the values are separated by commas.

Obtaining the Boundary box of Valencia with OpenStreetMap

Converting the output of the search function (XML element) into a Pandas DataFrame

The flickr.photos.search function sends back a parsed XML element when you call the function. This element contains multiple details about the photographs that match the search criteria. The rsp tag indicates in the attribute stat whether the call was executed successfully or not. The photos tag provides a summary of the search results. In this case, 17997 pictures match the search criteria organized in 72 pages. Within the photos tag, there is a list of photo tags, each of them containing the information of a particular photo. In this case, the attributes of interest are only four: (1) the id, (2) the latitude, (3) the longitude, and (4) the URL of the photo; however, as you can see below, the number of attributes provided by the search function to describe a photo is much larger.

As shown in the code above, we store the information of interest in a dictionary called photo_information, where the keys represent the attributes of interest and the values are lists containing the data, where each index stores the details of one photo.

Finally, we can easily convert this dictionary into a pandas data frame using the pandas.DataFrame constructor, as shown below.

Data Cleaning

After retrieving the information of interest, the data frame obtained contains only four columns: (1) id, (2) latitude, (3) longitude, and (4) url_n. It is a really simple data set; however, we can not use it directly to cluster the data. We need first to clean it out.

The data set does not contain missing values; however, the data types of the columns latitude and longitude are wrongly detected. We need to convert these columns into floats to be able to use them later in the DBSCAN algorithm.

After correcting wrong data types, we analyze whether or not the data frame includes duplicated photos. Every photo store in Flickr has its own unique id. Therefore, it is not possible to have two different pictures with the same id. As you can observe below, most of the photos are duplicated, so we need to eliminate them from the data frame.

Once the duplicated entries are removed, we eliminate the idand url_n columns from the data set, since they are no longer needed.

Now, we can use the data to obtain the clusters that indicate the most popular spots of the city.

DBSCAN algorithm

Theory

The DBSCAN (Density-Based Spatial Clustering of Applications with Noise) method is a density-based clustering algorithm used to separate high-density from low-density regions.

This algorithm is based on two hyperparameters:

The radius (eps): The maximum distance between two samples to be considered as neighbors.
The minimum number of points (MinPts): The number of samples in the neighborhood to consider an observation a core point.

Based on these hyperparameters, the DBSCAN algorithm classifies every observation in the data set as a core, border, or outlier point, according to the following rules:

Core point: A data point that has at least MinPts observations within its radius eps.
Border point: A data point that has within its radius eps less than MinPts points; however, the point is within the radius eps of a core point.
Outlier point: A data point that is neither a core point nor a border point.

Core, border, and outlier points — Image created by the author

Then, the points are assigned to clusters based on their types. Each cluster contains at least one core point and all border points that are reachable from it.

Advantages and disadvantages

The DBSCAN algorithm offers multiple advantages with respect to other clustering algorithms. The major strength of the DBSCAN algorithm is that it can find out clusters of any shape. The clusters do not have to have a blob shape. Additionally, it is not necessary to fix the number of clusters before executing the algorithm, as we have to do with the K-means method. Furthermore, DBSCAN is capable of detecting noise in the data set, in contrast to partitional-based algorithms such as K-means which assign all points to a cluster. With DBSCAN, the points located in the regions of low density are not assigned to any cluster.

However, the DBSCAN method has also some disadvantages. The main challenge of the algorithm is to find the right combination of the two hyperparameters (eps and MinPts). The choice of these hyperparameters is arbitrary and highly affects the results obtained with the algorithm. A common practice is to test different sets of hyperparameters and choose the one that produces acceptable results taking into consideration the number of clusters and the outliers generated.

Advantages and disadvantages of the DBSCAN algorithm — Image created by the author

Visualization of the observations with a scatter plot

The following plot shows the location of the photographs taken in Valencia in 2019 (available on the Flickr platform). The x-axis represents the longitude where the photo was taken, and the y-axis represents the latitude. As you can see, there are locations where clearly more photos were taken (high dense areas). The next step consists of using the DBSCAN algorithm to recognize these locations.

Implementation of the DBSCAN algorithm with Scikit-Learn

To implement the DBSCAN algorithm, we need first to instantiate a DBSCAN model, which can be imported from sklearn.cluster. As you can see below, the hyperparameters chosen for this particular dataset are: (1) eps= , and (2) min_samples=. These parameters were defined by trial and error. Notice that before applying the DBSCAN algorithm, we have normalized the data points with the MinMaxScaler class, so that all attributes (latitude and longitude) have the same range [0, 1].

We can determine the number of clusters by looking at the unique values of the labels. As you can see, some observations have an index equal to -1, meaning those observations are detected as outliers by the algorithm.

Lastly, we visualize the clusters excluding the noise (observations associated with a label equal to -1). Additionally, we have plotted the labels associated with each group in the cluster center.

Visualizing the centers on the clusters on an interactive map

The next step is to visualize the cluster centers on top of the map of Valencia. This approach will allow us to easily make associations between the centers of the groups and the locations in the city.

First, we construct a Folium map with a location and a zoom level. This will produce an empty map of the given location (in this case Valencia). Then, we render markers at the location of the cluster centers with the Marker function.

Associations of the clusters — Most important spots in Valencia

We can recognize the following points of interest by visualizing the center of the groups on top of the map of Valencia:

Cluster 0: The city of the arts and sciences
Cluster 1: The North Railway Station
Cluster 2: Burjassot Avenue
Cluster 3: Town Hall Square
Cluster 4: Benicalap Park
Cluster 5: The historic center of the city
Cluster 6: Bioparc (The zoo)
Cluster 7: Parroquia San Josemaría Escrivá
Cluster 8: El Carmen Neighborhood
Cluster 9: Sports Center (Alboraya)
Cluster 10: Museum of Fine Arts of València
Cluster 11: Institut Valencià d’Art Modern
Cluster 12: Jardines de Monforte
Cluster 13: Parc de Capçalera
Cluster 14: L’Oceanogràfic
Cluster 15: Resort Las Arenas
Cluster 16: Malvarrosa Beach

My favorite places 😍

Cluster — The historic center of the city

The historic center of Valencia is without any doubt the most charming part of the city. Nowadays, the historic center is the heart of leisure and trade in modern Valencia packed with shops, restaurants, and bars. The Central Market, the Cathedral, the Silk Exchange, and El Carmen Neighborhood are among the most important insights of Valencia Old Town.

Cluster — The city of the arts and sciences

The city of the arts and sciences is a cultural and scientific complex designed by the Valencian architect Santiago Calatrava. It is situated at the end of the Turia riverbed and it is one of the most visited spots in Valencia. Officially inagurated in April 1998, the city of the arts and sciences is made of 7 buildings: L’Hemisfèric (1998), El Museu de les Ciències Príncipe Felipe (2000), L’Umbracle (2001), L’Oceanogràfic (2003), El Palau de les Arts Reina Sofia (2005), and L’Àgora (2009). The complex offers a wide range of cultural activities and events during the whole year.

Alternative visualization — Heat map of the photographs

The last step of the analysis consists of visualizing the location of the photographs (latitude and longitude) using a heat map. Heat maps use color to display how a quantity changes across a region. In this case, we use Folium again to create a heat map that overlays on top of the map of Valencia.

As shown below, the historic center, the city of arts and sciences, the port, and the zoo are the most popular locations in Valencia.

Amanda Iglesias