Persistent Data Monitoring

While accomplishing use cases through the EmcienScan user interface can save a large amount of time and effort over traditional methods of data discovery, the true value in using EmcienScan lies in automating the use cases through the RESTful API's and running scans continuously. This allows for constant, up-to-date knowledge of your data and its qualities. Although you may use a variety of techniques to automate these steps, below we have listed the steps to take to automate these use cases through cURL commands executed on the command line.

Although the examples below list directions for collections of tables, running on individual tables within a database is easy. Just change the API endpoints from "scan_collections" to "scans". For the exact API destinations, see the "API Reference" section within EmcienScan, located here:

To continuously check a table or database for the presence of numerical and categorical outliers, the steps are as follows:

1) Start a scan. This can be done one-time via the user interface, or through the API's using the following cURL command:

curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '{ "params": { "data_source_name": "<Database Name>", "tables": ["<First Table Name>", "<Second Table Name>","...", "<Final Table Name>" ], "name": "<Your Collection Name>", "notes": "", "rows": "auto", "exclude_blanks": "true" }, "type": "db" }' 'https://<Your URL or Hostname>/api/v1/scan_collections?auth_token=<Your_Auth_Token>'

2) Once the scan has been created, we can persistently rebuild it at a fixed cadence by automating the following command:

curl -X GET --header 'Accept: application/json' 'https://<Your URL or Hostname>/api/v1/scan_collections/<Collection ID>/rebuild?auth_token=<Your Auth Token>'

3) To check the status of a Collection while it is being rebuilt, try:

curl -X GET --header 'Accept: application/json' 'https://<Your URL or Hostname>/api/v1/scan_collections/<Collection ID>/status?auth_token=<Your Auth Token>'

4) To retrieve the data for the scans within your collection, run:

curl -X GET --header 'Accept: application/json' 'https://<Your URL or Hostname>/api/v1/scan_collections/<Collection ID>?include=scans&auth_token=<Your Auth Token>'

When building a profile of your data, the relevant data points will be within the "data_data" structure, most notably the "overall_scan_outlierness". For an exhaustive guide of what each parameter means, scroll down to the bottom of this page. However, if you would like to retrieve the columns for a specific scan within a collection, run:

curl -X GET --header 'Accept: application/json' 'https:<Your URL or Hostanme>/api/v1/scans/<Your Scan ID>/columns?auth_token=<Your Auth Token>'

To continuously check a table or database for the presence of numerical and categorical outliers, the steps are as follows:

1) Start a scan. This can be done one-time via the user interface, or through the API's using the following cURL command:

curl -X POST --header 'Content-Type: application/json' --header 'Accept: application/json' -d '{ "params": { "data_source_name": "<Database Name>", "tables": ["<First Table Name>", "<Second Table Name>","...", "<Final Table Name>" ], "name": "<Your Collection Name>", "notes": "", "rows": "auto", "exclude_blanks": "true" }, "type": "db" }' 'https://<Your URL or Hostname>/api/v1/scan_collections?auth_token=<Your_Auth_Token>'

2) Once the scan has been created, you can persistently rebuild it at a fixed cadence by automating the following command:

curl -X GET --header 'Accept: application/json' 'https://<Your URL or Hostname>/api/v1/scan_collections/<Collection ID>/rebuild?auth_token=<Your Auth Token>'

3) To check the status of a Collection while it is being rebuilt, try:

curl -X GET --header 'Accept: application/json' 'https://<Your URL or Hostname>/api/v1/scan_collections/<Collection ID>/status?auth_token=<Your Auth Token>'

4) To retrieve data for the scans within your collection, run:

curl -X GET --header 'Accept: application/json' 'https://<Your URL or Hostname>/api/v1/scan_collections/<Collection ID>?include=scans&auth_token=<Your Auth Token>'

5) If you would like to retrieve the columns for a specific scan within a collection, run:

curl -X GET --header 'Accept: application/json' 'https:<Your URL or Hostanme>/api/v1/scans/<Your Scan ID>/columns?auth_token=<Your Auth Token>'

The relevant information for understanding the data and generating use cases lies within the individual column JSON files. However, you must include column relations in the API call to retrieve this information. In the UI, type "column_relations" in the "include" parameter. The API endpoint and cURL request for this request would be: 

curl -X GET --header 'Accept: application/json' 'https:<Your URL or Hostname>/api/v1/scans/<Your Scan ID>/columns?include=column_relations&auth_token=<Your Auth Token>'

Below are the definitions for terms found within the API returns.

Object name Description Example
"id" The ID for this column
"name" The name of this column
"type" The data type for this column
"Primary_type" The primary type of data within this column
"Unique_number_of_values" How many unique values are in this column
"min" The smallest number in this column
"lower_quartile" The number for the cutoff of the lower quartile The number signifying the boundary between the lower quarter of the data and the upper 75%
"median" The median of the numbers in this column
"upper_quartile" The number for the cutoff of the upper quartile The number signifying the boundary between the upper quartile of the data and the lower 75%
"max" The biggest number in this column
"mean" The mean of the numbers in this column
"standard_deviation" The standard deviation of the numbers in this column
"skewness" The skewness of the numerical distribution in this column  Knowing the skewness of a column can give you information to understand if the distribution is tailed in one direction or another.
"kurtosis" The excess kurtosis of the numerical distribution in this column If the Kurtosis of a column is negative, the distribution is flatter than a normal distribution. If it is positive, it will be sharper.
"group_id" The id for the column group that this column is in
"group_size" How many columns are in this column’s group
"column_connected_strength" the connected strength for this column
"redundant_column_id" The ID for the redundant column set, if it exists
"redundant_column_count" The number of redundant columns for the selected column
"frequency_histogram_data" The frequency distribution data for this column. Each bucket or group has the width of one half of one standard deviation.
"Most_frequent_categorical_data" The most frequently occurring categorical values in this column
“Least_frequent_categorical_data" The least frequently occurring categorical values in this column
"numeric_most_frequent_data" The most frequently occurring numerical values in this column
"numeric_least_frequent_data" The least frequently occurring categorical values in this column
"numeric_high_outlier_data" The high numeric outliers for this column 
"numeric_low_outlier_data" The low numeric outliers for this column
"frequency_outlier_data" The categorical frequency outliers for this column
"string_length_short_outlier_data" Small string length outliers in this column
"string_length_long_outlier_data" Long string length outliers in this column
"numeric_high_outlier_count" The number of high numeric outliers in this column
"numeric_low_outlier_count" The number of low numeric outliers in this column
"Frequency_outlier_count" The number of categorical frequency outliers are in this column
"string_length_short_outlier_count" The number of short string length outliers in this column 
"string_length_long_outlier_count" The number of long string length outliers in this column
"Categorical_row_count" The number of categorical values in this column
"integer_row_count" The number of categorical values in this column
"decimal_row_count" The number of decimal numeric values in this column
"empty_row_count" The number of empty values in this column
"numeric_outlierness" The outlierness of this column (link to glossary) for numeric values
"frequency_outlierness" The outlierness of this column (link to glossary) for categorical values
"string_length_outlierness" The amount of string length outliers in this column compared to the maximum possible amount using Chebyshev's inequality
"categorical_outlierness" The amount of categorical outliers in this column compared to the maximum possible amount using Chebyshev's Inequality Chebyshev's inequality states that no more than 1/9 of a column can exist more than 3 standard deviations away from the mean, which we use for outlierness
"outlierness" The total outlierness of this column
"numeric_histogram_data" The numeric distribution of data, where every bucket is half the standard deviation of these numbers
"numeric_outliers" Detailed information for the numeric outliers in this column
"lower_bound" The smaller cutoff for what constitutes an outlier
"upper_bound" The upper cutoff for what constitutes an outlier
"lower_count" The number of low outliers in this column
"Upper_count" The number of high outliers in this column
"frequency_outliers" Detailed information for the categorical frequency outliers in this column
"string_length_outliers" Detailed information for the categorical string length outliers in this column
"categorical_pareto_data" The distribution information for the categorical values in this column
"categorical_pareto_records" The total number of records in the categorical pareto data
"categorical_pareto_distincts" The distinct number of records in the categorical pareto data
"frequency_standard_deviation"  the standard deviation of the categorical frequency records
"string_length_standard_deviation" Standard deviation of the string length for categorical values in this data set
"column_connected_strength_dots" The number of connected strength dots for this column
"frequency_mean" The mean for the frequency of categorical values in this column
"string_length_mean" The mean for the string length of categorical values in this column
"scan_members" The individual tables or objects scanned in this collection
"user_id" The id for the user running the scan
"state" The current state of the scan collection (e.g. initialized, scanning, ready, failed)
"count" How many tables were scanned in this collection
"fail_count" The number of tables that failed to scan
"complete_count" The number of tables that were successfully scanned
"finished_at" The time that the scan collection finished
"creation_date" The timestamp when the scan was created
"last_updated" The time at which the collection was last updated. If finished, the time at which the collection finished
"name" The name of the scan collection
"error" The reason a scan did not complete, if a scan did not complete due to error
"source_row_count" The number of rows in the source table
"sampled_row_count" The number of rows that were sampled from the source table
"total_cell_count" The total number of cells in the sample
"Empty_cell_count" The total number of empty cells that were sampled
"decimal_cell_count" The number of cells sampled with decimal numbers
"integer_cell_count" The number of cells sampled with integer numbers
"numeric_cell_count" The total number of numeric cells sampled
"categorical_cell_count" The number of cells sampled with categorical values
"column_count" The number of columns in the table
"column_group_count" the number of column grouping in the table
"overall_connected_strength ": The overall connected strength for the table
"overall_connected_strength_dots" The number of connected strength dots in the table
"overall_scan_outlierness" The overall outlierness of the table, or the fraction of columns that contained outliers