Finding Outliers and Surprising Data Points

Being able to understand and find data points that don't seem to fit in or match the rest of the population is a key aspect in understanding your data. Whether they occur naturally or are caused by inaccurate data entry, EmcienScan is able to pinpoint these anomalies and give you the information you need to find them in your source data. EmcienScan defines outliers using the following criteria:

Numerical Outlier: Any number that is 3 or more standard deviations away from the mean

Categorical Outlier: A value is a categorical outlier if it A) is a value with a frequency of 3 or more standard deviations away from the mean frequency for values in that column, or B) has a string length that is 6 or more standard deviations away from the mean string length for values in that column

To find the surprising data points in a file, begin by scanning it. In the screenshots for this example, we will be using public medicare data found here that we uploaded into a MySQL Database.

From the home page, we can immediately tell how many columns in the file contain outliers by viewing the outliers bar in the rightmost column. A full bar means that every column within the data set contains outliers.

Clicking the scan will take you to the data overview page. Here you can order columns by how many outliers the column contains, and even filter down to only viewing columns with numerical or categorical outliers.

To filter the display select the Data Type from the list available and the screen will update.  The data type will now be highlighted and the number of columns will also change.

Maximum Outliers:The maximum number of outliers that a column can contain is determined using Chebyshev's Inequality, which states that no more than 1/kof a column can be k standard deviations away from the mean. Since EmcienScan defines outliers as being 3 or more standard deviations from the mean, that means that no more than 1/9 ≈ 11% of the column can be outlying values.

To investigate the outliers within a specific column, click on the column and scroll down to the data profile at the bottom of the page. If desired, the arrow in the top right can be clicked to collapse the predictability section. Numerical outliers will be outlined in orange in the histogam.

Categorical outliers may or may not be shown as orange within the histogram because their values may be tied to string length and not frequency.

As we can see in this example, outlier data points are those that are less than -27.3 to find these in your source data repository, click on the 'Outliers SQL' for sample SQL queries to find similar data points

The definitions for what constitutes an outlier for every column are found in the outliers .csv file, which can be downloaded from the menu on the right side of the page. This provides the information to query the source database to find all of the outliers within the data set.

This information is also available via API- to learn how to continually monitor your data set for outliers, please see our article on Persistent Data Monitoring.