Data Profiling

Every time a data set is scanned with EmcienScan, the software provides a full profile of the data, allowing the user to gain a deep understanding of what every column contains. This article illustrates how easy it is to profile data using EmcienScan and the information the application provides.

To begin, scan your file. In the screenshots for this example, we will be using public medicare data found here that we uploaded into a MySQL Database. For information on connecting EmcienScan to Databases, see our article on Configuring Data Sources.

Once the scan is completed, clicking on it from the home page will bring you to the data overview page. From here, you can view the distribution of every column alongisde its Predictive strength and degree of outlierness. Clicking on a certain data type on the right side of the screen will filter the data only down to columns that contain those data types. You can order the columns in whichever way you choose by clicking on the column header at the top of the table.

To get a more granular view of the data, visit the data profile screen to see each column's information with sample data to get a deeper understanding of the data set. In this page lies a complete profile of the data. After the column's name is listed its data type, overall connected strength, a view of its distribution, the range of values observed within the column and the outlierness of the column.

Clicking on one of the columns names will bring you to the data profile section for that column, giving additional information such as frequency values as well as the outlier ranges for that column.  Scroll down to the data profile at the bottom of the page. If desired, the arrow in the top right can be clicked to collapse the predictability section.

Clicking on 'Stats' will bring to the screen a list of descriptive statistics about the column.

All of this data is available via API's to allow for the generation of a recurring and up-to-date data profile. To find out more, please see our article on Persistent Data Monitoring.