Welcome to EmcienScan, a data discovery tool that allows a user to explore and compare the contents of any dataset, learn about potential use-cases, and reduce data preparation.
Below is a walk through of how to scan data and interpret the results. Sample data files can be found in the sample dataset repository on this Knowledge Base. Hover text descriptions can be found throughout EmcienScan.
Step 1: Uploading data
The data upload box is located in the upper right corner of the EmcienScan home page. Here the user can upload a data file or access one of their database connections.
By default the file encoding and delimiter are auto detected. The user can change these manually by clicking on the “optional settings” link in the scan box.
This button allows the user to preview the data and select which columns can be removed before scanning.
The user can customize individual scan file names or database tables. When a collection of database tables or views is selected, the user can choose to add a prefix to each. In a similar fashion to uploading a CSV file, there is a section for notes about this scan in the optional settings. The user is also able to input a sample size to be pulled from the table or view being scanned. Additionally, the user can select how many rows they would like to sample from the file. The default setting is to allow EmcienScan to automatically detect how many rows it will need to sample from the file. To find out more about the autodetect feature, check out our glossary.
This button allows the user to preview the data and select which columns to be removed before scanning.
Step 2: View results
The analyzed data is now visible on the EmcienScan home page. EmcienScan displays the connected strength of each scan through a system of five dots. The dots show the overall connected strength of the datasets by the amount of green filling they contain. These dots are also a measure of the quality of the dataset, as high quality data will contain may predictive realtionships and low quality, disparate dataset will contain no natural patterns. Click here to learn more about five dot system.
The home page also shows the relative outlierness of a data set. The outlierness of a data set is determed by measuring the number of columns that contain outliers. Click here to learn more about how EmcienScan finds outliers.
Also displayed on the home page are scan collections, which are groups of scans run at the same time. You can run multiple scans at the same time when connected to a database by holding the control or shift key and selecting multiple tables in the database.
Clicking into a scan will take you to the Data Overview page. The top of the Data Overview page displays the datasets overall connected strength, as well as what type of data is in the dataset and the outlierness of the data. As you scroll down the page, you can see the connected strength for every column, as well as how many outliers are in the column and what kind of datatype the column is. On the right, you can filter down the home page to specific data types, or find only certain types of outliers as well. Highly connected fields indicate that data in the selected field is influenced by data in other fields. When scrolling through the list of fields in the left panel the user can easily see which fields are connected, and which fields are not.
Step 2: Focusing on a specific outcome field
By clicking on a field name in the left panel the user can explore even more detailed information on what this field is correlated with. On the Column Details page a list of how all the other fields in the dataset directly correlated with the selected outcome field by their respective connected strengths is displayed. Click here to learn more about connected strength.
At the top of the list are the most connected fields, or the most important factors to consider when trying to model or predict this outcome field.
Notice that some fields in the list have relationships with other fields.
Redundant Column Set,
RC, is when the value in one field can always be determined by the value of another field in the same set.
PD, is when values in the selected field can always be determined by the values in this field.
Lastly, there are the Unique Value,
UV, fields that should not be considered for most analyses, as every value is unique- typically meaning that the field is a serial number or counter. Further down on the list are the very weakly correlated or disconnected fields. These fields can also be excluded from data preparation or analysis, as they are not relevant to the selected field.
The Column Details page can be used to:
- Discover new use cases within data by discovering the key predictors and connectors within a dataset
- Optimize data through reducing the amount of data needed to be moved, stored or cleaned by identifying unnecessary and low value parts of the dataset
- Aid in predictive modeling through immediately revealing the key drivers of a field as well as the weakly correlated fields that should not be carried through to analysis
Scrolling down the page, there is a full profile of the data, complete with the data distribution and the types and amounts of outliers found in the data set.
The benefit to this page is that it shows you what is in your data- taking the guesswork out of determining which fields to include when analyzing the selected outcome field, allowing the analyst to only include fields with stronger predictive relevance and excluding the rest. EmcienScan identifies which fields are not correlated with the selected outcome field, allowing the user to exclude many of them from the model.
This will reduce the fields in a model, and the amount of information the user has to store and cleanse for analysis- giving the user an idea of what is in the dataset as well as the key fields to use when trying to predict a possible field of interest.
Step 4: Getting to know a dataset
Column Groups are sets of correlated fields that relate to each other. On the Column Groups page the user can see sets of correlated fields produced from the data set.
The insight discovered from Column Groups can be used for:
- Data Optimization: The reduction of the amount of data needed to be moved, stored or cleaned by identifying unnecessary and low value parts of your data and segmenting out your columns of interest
- Data Segmentation: The column groupings provide a segmentation that allows for optimal storage for “overloaded” tables within Hadoop or MySQL tables
- Business Intelligence Reports: The natural groupings within the data can be viewed within a BI report for quick insights
EmcienScan has now given the user a comprehensive view of what is in the dataset, including the relationships between numeric and categorical data in the dataset, which is difficult with other data analysis tools. Scan has quickly reviewed the data without any data preparation and separated the valuable data points from the noise for quicker and easier analysis.