References → Aggregation Recipe

The Aggregation recipe is an essential tool for performing summarizations, statistical analysis, or aggregations on data sets. This can be particularly useful when performing large-scale processing and analytics tasks encountered in big data environments.

Configuration

Configuration	Description
Recipe Name	A freeform name of how a user would like to name a recipe
Input	Select a previously constructed recipe to process
Grouping Columns	Choose the column from the input data that will be cleaned and matched
Aggregations	Select the primary key columns for the data input

Aggregation Methods

Method	Shorthand	Description
Count	COUNT	Returns count of values for each column for each group
Sum	SUM	Returns the total sum of values for each column for each group
Average	AVG	Returns average values for each column for each group
Min	MIN	Returns minimum values for each column for each group
Max	MAX	Returns maximum values for each column for each group
First	FIRST	Returns the first value for each group
Last	LAST	Returns the last value for each group
Sum distinct	SUM DISTINCT	Returns the sum of the distinct values for each group
Avg distinct	AVERAGE DISTINCT	Returns the average of the distinct values for each group
Collect list	COLLECT_LIST	Returns all objects with duplicates for a group
Collect set	COLLECT_SET	Returns a distinct set of values for a group
Standard deviation	STDDV	Alias for stddv
Standard deviation sample	STDDV_SMP	Returns the unbiased standard deviation of a group
Variance	VARIANCE	Alias for var_samp
Variance Sample	VARIENCE_SAMP	Returns the unbiased sample variance of the values in a group
Skewness	SKEWNESS	Returns the skewness of values in a group- and indication on whether data points are spread our more on one side of the mean or the other.
Kurtosis	KURTOSIS	Returns a statistical measure describing the data points distribution in a dataset. Kurtosis emphasizes the “tailedness” of the data distribution - how heavy or light distribution tales are compared to normal distributions.
Percentile Approx	PERCENTILE_APPROX	Returns the approximate percentile between 0.0 and 1.0. of a numeric column. Approximate trades-off accuracy for performance. For exact percentiles, use the percentile aggregator.
Correlation	CORR	Calculates the correlations of two columns as outputs, and the result is a double. Today, this only supports the Pearson Correlation Coefficient.
Covariance Population	COVAR_POP	Aggregator used to compute the population covariance between two sets of data. Covariance is a measure of how much two random variables vary together.
Bit and	BIT_AND	Bitwise AND is a binary operation that takes two binary numbers as operands and performs the AND operation on each pair of corresponding bits. The result is a new binary number where each bit is set to 1 if both of the corresponding bits of the operands are 1, and 0 otherwise.
Bit or	BIT_OR	Bitwise OR is a binary operation that takes two binary numbers as operands and performs the OR operation on each pair of corresponding bits. The result is a new binary number where each bit is set to 1 if at least one of the corresponding bits of the operands is 1, and 0 otherwise.
Bit xor	BIT_XOR	Bitwise XOR (exclusive OR) is a binary operation that takes two binary numbers as operands and performs the XOR operation on each pair of corresponding bits. The result is a new binary number where each bit is set to 1 if exactly one of the corresponding bits of the operands is 1, and 0 otherwise.

Content

Configuration

Aggregation Methods