This page has been machine-translated and may contain inaccuracies in phrasing or product terminology. If discrepancies exist, the original Japanese version takes precedence.
This document explains the various statistical information displayed as detailed metadata of assets and the calculation mechanisms. The calculation logic varies depending on the data source service involved.
Intended Audience for This Explanation
What Are Statistical Information
Statistical information is one of the detailed metadata of DB data assets, representing calculated values showing what kind of data, how much, and in what distribution is stored in tables or views. In QDIC, statistical values are obtained from the data sources of the user organization and reflected in the catalog.
- The target assets are tables and views.
- Statistical values are calculated for each column.
- There may be limitations on the data sources and assets for which statistical information is displayed. Please contact your administrator for details.
Statistical Values Obtained
Statistical values are calculated only when the column data type is numeric. However, some statistical values are calculated even if the column is of string or date type. Statistical values are not calculated for columns of other data types.
Statistical Values Calculated Only for Numeric Column Data Types
- Maximum value
- Minimum value
- Average value
- Median
- Mode
- Standard deviation
Statistical Values Calculated for Numeric, String, and Date Column Data Types
- Number of unique values
- Number of NULLs
Mechanism of Statistical Value Calculation
The mechanism for calculating statistical values varies depending on the number of target data. For some data sources, administrators may limit the number of data items to be calculated or the statistical values to be derived to reduce the operational cost burden on the data source engine.
Amazon Redshift
Statistical values are calculated based on all data in the data source. To reduce operational costs of the data source, the statistical values calculated may be limited.
Databricks
Statistical values are obtained using the data source mechanism (Lakehouse Monitoring).
Google BigQuery
Statistical values are obtained using the data source mechanism (Dataplex).
Snowflake
Statistical values are calculated based on sampled data. Because of sampling, the accuracy of the statistical values depends on the sample size. In other words, if not all data is targeted, the values displayed in the catalog should be considered as reference values only. The sample size is determined by administrators and environment operators considering the connector execution time and Snowflake load to manage operational costs.
Relationship Between Sample Size and Statistical Values
Administrators often decide the sample size based on the following guidelines:
- Assuming a binomial distribution with a margin of error of 5%, confidence level of 95%, and population proportion of 0.5, the sample size is about 1,500.
- Assuming a binomial distribution with a margin of error of 10%, confidence level of 95%, and population proportion of 0.5, the sample size is about 400.
* In general surveys, a sample size of about 400 is often used as a guideline, but if a narrower margin of error is desired, a sample size of about 1,500 is the guideline.