This article features a collection of terms and phrases used within EmcienScan.
To see how results from EmcienScan compare to results from other statistical or analysis software, try computing a correlation matrix using only numerical data and note the similarity to your results in EmcienScan. However, unlike most other statistical software, EmcienScan can compute the correlations for non-numeric data as well.
Five full green dots indicate that a specific field or many of the fields in the dataset are very connected. The dataset or specific field contains a large amount of information allowing the user to discover relationships between fields and the key drivers for understanding a desired outcome.
Five empty gray dots indicate that a specific field or all of the fields in the dataset are not connected at all. Knowing that data is disconnected is just as valuable as knowing the key drivers within a dataset, and provides a measurement of the quality of data before analysis.
One dot outlined in green indicates that the selected field or the majority of fields within the dataset are slightly predictable.
Most fields or datasets will have some value between the two extremes, this is fairly representative of natural data, that will contain natural patterns as wel as disparate data points.
Redundant Column Set,
RC, is displayed when the value in one field can always be determined by the value of another field in the same set.
PD, appears when values in the selected field can always be determined by the values in this field
UV, indicate fields that should not be considered for most analyses, as every value is unique- typically meaning that the field is a serial number or counter.
Mostly Unique Values,
MU, indicates that the column consists of mostly unique values, and may not be helpful for future analyses
1) Numeric Outliers: Any number that is 3 or more standard deviations away fom the mean
2) String Length Outliers: Any categorical entry whose string length is 6 or more standard deviations from the mean string length
3) Frequency Outliers: Any categorical entry whose frequency is 3 or more standard deviations from the mean frequency for a value
The Outlierness of a column is a measure of how many outliers are in that column compared to the maximum possible amount. The maximum number of outliers that a column can contain is determined using Chebyshev's Inequality, which states that no more than 1/k2 of a column can be k standard deviations away from the mean. Since EmcienScan defines outliers as being 3 or more standard deviations from the mean, that means that no more than 1/9 ≈ 11% of the column can be outlying values.
EmcienScan randomly samples across databases and files to ensure accuracy in discovery while delivering results very quickly. For every data set, EmcienScan will calculate how much it will need to sample, and reports that number in the top right of the home page. If you would like to test if the autodetect feature will work for your data sets, try running the same set multiple times to see if the results change.