LOF to find outlying cities (R code) (Part 1 of 2)
Introduction
Canada has a concentrarion of relatively big cities in the south and sparsely located cities in the north. But, which cities show the most distinct behavior in terms of geographical location and population? In this post we will use LOF (Local Outlier Factor), a machine learning algorithm that uses neighborhoods, distances and densities to find outlying points. We will cover the main data analysis steps needed to find the most outlying cities: first data pre-processing , next an exploratory analysis, then the algorithm LOF, results section and finally the conclusion. As always remember that the objective of this blog is to explain the algorithm while using fun datasets(the “fun” factor is reader dependent!).Data Pre-processing
The dataset comes from http://www.tageo.com/index-e-ca-cities-CA.htm. I am using this data only because it was ready to use and to download, so i apologize if you find some innacuracies in the data, but again we are not here for the data but to understand the algorithm.This dataset has 5 variables, but we will not use the first one (ranking of the cities), then the remaining 4 variables are:
- City (Factor). Name of the city.
- Population (Integer). Population of the city in year 2000.
- Latitude (Numeric). Geographical latitude.
- Longitud (Numeric). Geographical Longitude.
Exploratory analysis
First let’s explore the dataset with the str() function:str(data)
## 'data.frame': 300 obs. of 4 variables:
## $ City : Factor w/ 300 levels "Abbotsford","Acton",..: 270 159 277 36 179 74 105 204 295 128 ...
## $ Population2000: int 4551800 3256300 1836500 938300 863000 812400 645100 643200 637900 403800 ...
## $ Latitude : num 43.6 45.5 49.3 51 45.4 ...
## $ Longitude : num -79.4 -73.6 -123.1 -114.1 -75.7 ...
Now, let’s plot a map of Canada and its main cities. First, we can obtain the data to outline Canada with the function getData() from the raster library, the data comes from http://gadm.org/. We can plot the map with the plot() function, this will take a minute or two depending on your computer. With the map plotted we can add the main cities of Canada using the points() function with the latitude and longitude as its x and y arguments. It is worth to mention that can_data (Map of Canada) is only used to have a visual aid that help us undersand the location of the cities, but obviously K-means will not use this information.library(raster)
can_map<-getData('GADM', country="CAN", level=1) # provinces
plot(can_map,sub="Figure 1. Map of Canada with cities>=5000 in year 2000")
points(data$Longitud,data$Latitud,pch=19,col="blue",cex=.5)
On observation of Figure 1, it is a clear a tendency to live in the south, maybe because we want to escape the coldest temperatures in the north (the winter is coming!), also the cities tend to form clusters with some some outlying cities in the north.
The algorithm
Cluster analysis comprehend a class of algorithms that tries to classify similar observations into the same group. The main types of cluster algorithms are:- Centroid based.
- Density based.
- Connectivity based.
- Breunig, M. M., H.-P. Kriegel, R. T. Ng, #246 and r. Sander (2000). “LOF: identifying density-based local outliers.” SIGMOD Rec. 29(2): 93-104.