References → Aggregation Recipe

The Aggregation recipe is an essential tool for performing summarizations, statistical analysis, or aggregations on data sets. This can be particularly useful when performing large-scale processing and analytics tasks encountered in big data environments.

Configuration

ConfigurationDescription
Recipe NameA freeform name of how a user would like to name a recipe
InputSelect a previously constructed recipe to process
Grouping ColumnsChoose the column from the input data that will be cleaned and matched
AggregationsSelect the primary key columns for the data input

Aggregation Methods

MethodShorthandDescription
CountCOUNTReturns count of values for each column for each group
SumSUMReturns the total sum of values for each column for each group
AverageAVGReturns average values for each column for each group
MinMINReturns minimum values for each column for each group
MaxMAXReturns maximum values for each column for each group
FirstFIRSTReturns the first value for each group
LastLASTReturns the last value for each group
Sum distinctSUM DISTINCTReturns the sum of the distinct values for each group
Avg distinctAVERAGE DISTINCTReturns the average of the distinct values for each group
Collect listCOLLECT_LISTReturns all objects with duplicates for a group
Collect setCOLLECT_SETReturns a distinct set of values for a group
Standard deviationSTDDVAlias for stddv
Standard deviation sampleSTDDV_SMPReturns the unbiased standard deviation of a group
VarianceVARIANCEAlias for var_samp
Variance SampleVARIENCE_SAMPReturns the unbiased sample variance of the values in a group
SkewnessSKEWNESSReturns the skewness of values in a group- and indication on whether data points are spread our more on one side of the mean or the other.
KurtosisKURTOSISReturns a statistical measure describing the data points distribution in a dataset. Kurtosis emphasizes the “tailedness” of the data distribution - how heavy or light distribution tales are compared to normal distributions.
Percentile ApproxPERCENTILE_APPROXReturns the approximate percentile between 0.0 and 1.0. of a numeric column. Approximate trades-off accuracy for performance. For exact percentiles, use the percentile aggregator.
CorrelationCORRCalculates the correlations of two columns as outputs, and the result is a double. Today, this only supports the Pearson Correlation Coefficient.
Covariance PopulationCOVAR_POPAggregator used to compute the population covariance between two sets of data. Covariance is a measure of how much two random variables vary together.
Bit andBIT_ANDBitwise AND is a binary operation that takes two binary numbers as operands and performs the AND operation on each pair of corresponding bits. The result is a new binary number where each bit is set to 1 if both of the corresponding bits of the operands are 1, and 0 otherwise.
Bit orBIT_ORBitwise OR is a binary operation that takes two binary numbers as operands and performs the OR operation on each pair of corresponding bits. The result is a new binary number where each bit is set to 1 if at least one of the corresponding bits of the operands is 1, and 0 otherwise.
Bit xorBIT_XORBitwise XOR (exclusive OR) is a binary operation that takes two binary numbers as operands and performs the XOR operation on each pair of corresponding bits. The result is a new binary number where each bit is set to 1 if exactly one of the corresponding bits of the operands is 1, and 0 otherwise.