Understanding how your data connects allows for use cases to be generated from previously dark or unknown data (to learn about "Dark Data", click here). Below is a walk through of how to scan data to generate use cases, and understand how the data connects.
To begin, upload your file. The sample data file used in this example can be found as the HR Dataset in the sample dataset repository on this Knowledge Base.
The scanned data is now visible on the EmcienScan home page. EmcienScan displays the connected strength of each scan through a system of five dots. The dots show the overall connected strength of the datasets by the amount of green filling they contain. These dots are also a measure of the quality of the dataset, as high quality data will contain many connected realtionships and low quality, disparate datasets will contain no natural patterns. Click here to learn more about the five dot system.
Now that the scan is complete, simply click on the link to be taken to the data summary view. EmcienScan has combed through the file to find the connections for each field, as well as how connected the dataset is as a whole. The data overview features every column in the data set with its measure of connected strength, the types of data in the column, and the distribution of values within the column as well as its degree of outlierness. The more connected a field is, the more natural correlations and connections the field shares with other columns in the data set. Highly connected columns will typically be easier to analyze, while analysis of columns with little or no connected strength may not be as fruitful.
For this example, imagine being tasked with understanding what drives employee attrition at this company. By clicking on 'Attrition' we are taken to the column detail page for that field. At the top of the page is an ordered list of how every field in the data set connects with 'Attrition'. At the top of the list are the most predictive fields, or the most important factors to consider when trying to model or predict this outcome field.
Scrolling down from the top of the page, EmcienScan has also identified the columns that are very weakly correlated to the outcome of attrition, and may add noise to any analysis we continue with. This is incredibly valuable, as it saves you time in storing, transferring, and cleaning of those columns for any analysis that might be carried forward.
EmcienScan has given us all of the information we need to start a successful analysis of employee attrition at this company. But imagine that your manager comes to you and offers you the opportunity to carry out an analysis on employee education instead. How would you choose what to analyze?
EmcienScan makes this task easy by allowing you to rank your use cases by predictive strength. To start, compare the two columns on the data overview page:
Notice that Attrition has two connected dots, while Education has only one. Clicking on education gives us a view of how the variable correlates with all of the other columns within the data set:
Education only has a few non-noisy correlating variables, and their connected strength is very small. What this means is that given the option between analyzing employee attrition or employee education for root cause or key drivers, it will be much easier to analyze attrition. Additionally, it also tells you that to analyze an employee's education level, you will need to go get more data, or possibly clean or augment the data set to make it more connected.
Getting to know a dataset
Column Groups are sets of correlated fields that relate to each other. On the Column Groups page the user can see sets of correlated fields produced from the dataset by their respective predictive group strengths. The predictive group strength represents the overall correlative strength for each group of related fields. The ungrouped fields are those that do not exhibit a significant relationship with any fields in the dataset.
The insight discovered from Column Groups can be used for:
- Data Optimization: The reduction of the amount of data needed to be moved, stored or cleaned by identifying unnecessary and low value parts of your data and segmenting out your columns of interest
- Data Segmentation: The column groupings provide a segmentation that allows for optimal storage for “overloaded” tables within Hadoop or SQL tables
- Business Intelligence Reports: The natural groupings within the data can be viewed within a BI report for quick insights
EmcienScan has now given the user a comprehensive view of what is in the dataset, including the relationships between numeric and categorical data in the dataset, which is difficult with other data analysis tools. EmcienScan has quickly reviewed the data without any data preparation and separated the valuable data points from the noise for quicker and easier analysis.
For information on how to continually check the connected strength of every variable within your data set to monitor the data set's structure, see our article on Persistent Data Monitoring.