Course | Offered by | Taken date | Difficulty | Commitment | Type | Description |
---|---|---|---|---|---|---|

Introduction to Data Science in Python | Coursera. University of Michigan | January 2017 | ★★★☆☆ | 10 hrs per week | Theory and practice | |

Data Science in Real Life | Coursera. Johns Hopkins University | March 2017 | ★☆☆☆☆ | 3 hrs per week | Theory and practice | |

A Crash Course in Data Science | Coursera. Johns Hopkins University | April 2017 | ★☆☆☆☆ | 3 hrs per week | Theory | A really introductory course. A lot of definitions about important concepts in data science. Any student should be able to pass this course |

Data-driven astronomy | Coursera. The University of Sydney | April 2017 | ★★★☆☆ | 6 hrs per week | Theory and practice | An extremely fun course, really intuitive. Perfect for anyone interested in Astronomical data |

Data processing using python | Coursera. Nanjing University | December 2017 | ★☆☆☆☆ | 4 hrs per week | Practice | An easy course. However, it is in Chinese and reading the subtitles is extremely tiring |

Python data representations | Coursera. Rice University | December 2017 | ★☆☆☆☆ | 4-5 hrs per week | Practice | An very interesting course, the lectures are great. Very good professors. |

Introduction to Structured Query Language (SQL) | Coursera. University of Michigan | December 2017 | ★☆☆☆☆ | 2-3 hrs per week | Practice | One of the best professors. A simple but great introductory course to sql |

SQL for Data Science) | Coursera. University of California, Davis | January 2018 | ★☆☆☆☆ | 4 hrs per week | Theory & Practice | A really great way to learn SQL. It is an introductory course but it covers more aspects than a similar course from U. Michigan. If you are completely new to SQL i recommend to take first the one from U. Michigan |

An Intuitive Introduction to Probability | Coursera. University of Zurich | January 2018 | ★☆☆☆☆ | 3-4 hrs per week | Theory and Practice | A very clear introduction to probability. The professors introduce the theory with interesting examples. |

Python Data Analysis | Coursera. Rice University | February 2018 | ★★☆☆☆ | 4-5 hrs per week | Practice | This is an easy course if you are familiar with nested dictionaries in Python, otherwise it will be difficult at the beginning of the course |

Python Data Visualization | Coursera. Rice University | March 2018 | ★★☆☆☆ | 4-5 hrs per week | Practice | This is the second part of the course Python Data Analysis; thus, if you are comfortable with dictionaries you should be fine |

# Real life Outlier detection

Blog dedicated to explain the most basic machine learning algorithms using fun and interesting real world data, with a focus on outlier detection (Using the data analysis software R)

## Wednesday, December 13, 2017

### Personal evaluation. Data science MOOCs

## Saturday, June 4, 2016

### Chasing outliers

# Chasing outliers

####
*JOSE R. PASILLAS D.*

####
*May 17, 2016*

One of the simplest types of outliers is that known as point outliers. The most basic example of this type are the values found at the extremes of a series of points. In R it is really simply to identify these outliers using the function

*outlier*in the library outliers. However, for now we will identify them without using a specific library, instead we will use the basic and simplistic rule of 3 standard deviations (\({\sigma}\)) above the mean (

*Figure 1*). Then, any observations lying beyond 3\({\sigma}\) is declared as an outlier.

```
## [1] 0.72 0.88 1.78 1.04 1.06 1.86 1.23 0.37 0.66 0.78 1.61 1.18 1.20 1.06
## [15] 0.72 1.89 1.25 0.02 1.35 0.76 0.47 0.89 0.49 0.64 0.69 0.16 1.42 1.08
## [29] 0.43 1.63 7.21 6.85
```

In this small and, lets face it, unreal example the outliers 7.21 and 6.85 are located in a single dimensional dataset. Then, it is pretty easy to find them visually with the aid of a boxplot, on examination of *Figure 2*it is evident that there is a pair of outliers(blue circles above the upper Whisker) in the data.

With a single line of code it is possible to find them using the 3\({\sigma}\) above the mean rule.

```
extreme_values<-small_example[which(small_example>mean(small_example)+(sd(small_example)*3))]
extreme_values
```

`## [1] 7.21 6.85`

In this small and, lets face it, unreal example the outliers 7.21 and 6.85 are located in a single dimensional dataset; then, it was really easy to find them using the 3\({\sigma}\) rule. However, in real world datasets the identification of outliers usually is done considering more complex scenarios, with multiple and interacting features, huge dimensionalities, etc.Now, we will work in a more complex and realistic scenario where the outliers are located in more than a single dimension. With this purpose we’ve generated a two dimensional dataset. This artificially generated dataset is also far from real world scenarios where the dimensionality can reach millions of features, like in the genomic or the semantic analysis scenarios.However, despite its limitations this small dataset will suffice to explain a different way to search outliers. In

*Figure 3*we have plotted the dataset, clearly there are two clusters \(C_{1}\) and \(C_{2}\), the former has a very high density as its observations or points are very close to one another, the later being its points more sparsely distributed has a lower density. A problem with this type of scenario with different cluster densities is that it is not possible to use a simple algorithm like k-NN. k-NN could easily find the outlier \(O_{2}\); however, k-NN can have problems to identify outlier \(O_{1}\), due to the different densities of the clusters, possibly missing the outlier \(O_{1}\) and instead classifying it as a normal or inlier observation.

In this type of scenario a better approach is to use a variation of k-NN known as Local Outlier Factor (LOF). LOF computes outlier scores by taking into account the different densities of each cluster and the distance to each neighbor. We can compute LOF outlier scores for each observation in the dataset using the function

*lofactor()*in the DMwR library. The LOF function (Figure 3) takes only two parameters: the dataset and number of neighbors

*k*; however, finding the best

*k*depends on some factors like the size of the available sample, size of clusters, etc. These factors are completely application dependent, being then outside the limits of this post. We will arbitrarily fix

*k*to a value of 5, but different values of

*k*should be tested to find the best parameters for the specific dataset under study.

Let’s go over each component of the function

*lofactor()*(

*Figure 4*):

- A variable to save the outlier score(one from each observation).
- The function that actually computes the outlier scores.
- The data.
- the number of neighbors
*k*.

*R DMwR*implementation to compute outlier scores with LOF can be done with a single line of code, which produces a score for each observation in the dataset, the higher the score the higher LOF considers that observation to be an outlier, and the lower the score the higher LOF considers the observation to be a normal observation.

```
library(DMwR)
outlier.scores <- lofactor(data01,5)
```

On observation of the summary and density plot(*Figure 5*) of the scores generated by LOF it is clear that the distribution of scores is right skewed, with a mean and median of 1.2 and 1.0 respectively . Indeed, in this simplistic example LOF correctly assigned the highest scores to the two outlying observations, an oultier score of 9.14 to \(O_{2}\) and 3.71 to \(O_{1}\).

`summary(outlier.scores)`

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.9159 0.9901 1.0440 1.2290 1.1710 9.1460
```

`plot(density(outlier.scores),sub="Figure 5. Outlier scores density")`

It is worth noting that even an algorithm based on relative densities (or local densities) like LOF, can have some difficulties in differentiating between normal instances and outliers like \(O_{1}\). Then, when tuning the algorithm parameters, data visualization, exploratory analysis and expert knowledge are key factors to obtain the best possible output, and different iterations of the algorithm need to be done before finding the sweet spot.

In this post we have explained two iconic approaches for outlier detection: the 3\({\sigma}\) rule(mainly used for extreme value detection) and Local Outlier Factor (LOF). LOF is a very useful algorithm that can be used in real world scenarios, mainly due to its capacity to adapt to different local densities. However, many more algorithms for outlier detection exist in the literature 2, each of them based on different assumptions about what constitutes an outlier, being their use completely application dependent. Nevertheless, these algorithms have a common characteristic: their implicit design to find those observations exhibiting an aberrant behavior.

Subscribe to:
Posts (Atom)