Glossary of Terms

This article features a collection of terms and phrases used within EmcienScan.

Connected Strength is measured based on a type of correlation between every pair of columns. Because data can be both numeric and categorical, EmcienScan uses mathematics involving information theory to measure and rank the data.

To see how results from EmcienScan compare to results from other statistical or analysis software, try computing a correlation matrix using only numerical data and note the similarity to your results in EmcienScan. However, unlike most other statistical software, EmcienScan can compute the correlations for non-numeric data as well.

The amount of green filling in the dots indicates the relative connected strength or quality of a dataset or field. For example:  
Five full green dots indicate that a specific field or many of the fields in the dataset are very connected. The dataset or specific field contains a large amount of information allowing the user to discover relationships between fields and the key drivers for understanding a desired outcome.

Five empty gray dots indicate that a specific field or all of the fields in the dataset are not connected at all. Knowing that data is disconnected is just as valuable as knowing the key drivers within a dataset, and provides a measurement of the quality of data before analysis.

One dot outlined in green indicates that the selected field or the majority of fields within the dataset are slightly predictable.

Most fields or datasets will have some value between the two extremes, this is fairly representative of natural data, that will contain natural patterns as wel as disparate data points.

EmcienScan will use badges to illustrate the relationships that some columns have with other fields.

Redundant Column Set,RC, is displayed when the value in one field can always be determined by the value of another field in the same set.

Perfect Determiner,PD, appears when values in the selected field can always be determined by the values in this field

Unique Values,UV, indicate fields that should not be considered for most analyses, as every value is unique- typically meaning that the field is a serial number or counter.

 Mostly Unique Values,MU, indicates that the column consists of mostly unique values, and may not be helpful for future analyses

A collection can be typically thought of as a group of scans run at the same time from a database. However, every individual scan is given a collection id when viewed through the API's. Therefore, every scan run is given a collection id, but when referring to collections in the UI they can be considered to only be groups of multiple scans.

Within EmcienScan, there are several types of outliers:

1) Numeric Outliers: Any number that is 3 or more standard deviations away fom the mean

2) String Length Outliers: Any categorical entry whose string length is 6 or more standard deviations from the mean string length

3) Frequency Outliers: Any categorical entry whose frequency is 3 or more standard deviations from the mean frequency for a value

The Outlierness of a column is a measure of how many outliers are in that column compared to the maximum possible amount. The maximum number of outliers that a column can contain is determined using Chebyshev's Inequality, which states that no more than 1/kof a column can be k standard deviations away from the mean. Since EmcienScan defines outliers as being 3 or more standard deviations from the mean, that means that no more than 1/9 ≈ 11% of the column can be outlying values.

EmcienScan randomly samples across databases and files to ensure accuracy in discovery while delivering results very quickly. For every data set, EmcienScan will calculate how much it will need to sample, and reports that number in the top right of the home page. If you would like to test if the autodetect feature will work for your data sets, try running the same set multiple times to see if the results change.