Unions (Set Operations)

Table of Contents

1. General Information
2. Principles
3. Types of Operations

1. General Information

In each step of a DataBlock , you can apply a Union to combine two datasets – whether they originate from Entities or other DataBlocks.

DataBlocks Documentation - see Step Creation Chapter

This mechanism allows merging, intersecting, or comparing homogeneous sets for analysis, filtering, or restructuring of data flows.

Examples:

Merge setA and setB to obtain all their rows.
Identify common rows or, conversely, different rows between two sets.

The combined columns must have the same number and the same types (e.g., impossible to merge a text column with a numeric column). SetB cannot have more columns than setA.

The feature allowing identification of different rows between two sets is only available since version 8.9.0.

2. Principles

Configuring a Union involves:

Selecting the target DataBlocks/Entities
Choosing the operation type (see next section).
Checking the schema compatibility (types & column order).

The results are injected into the current DataBlock, column by column, in the defined order.

3. Types of Operations

The subsections each describe the four supported operations, illustrated using the two example DataBlocks:

DataBlockA: “Patients”

No	First Name	Last Name	Age
001	Harry	Potter	16
002	John	Wick	42
002	John	Wick	42
003	James	Bond	45

DataBlockB: “NewPatients”

No	First Name	Last Name	Age
001	Harry	Potter	16
004	Emmett	Brown	63
005	Geralt	DeRiv	null

3.3. UNION (UNIONALL)

Purpose: Merge A and B by stacking all rows, without automatic deduplication.

No	First Name	Last Name	Age
001	Harry	Potter	16
002	John	Wick	42
003	James	Bond	45
004	Emmett	Brown	63
005	Geralt	DeRiv	null

Note: equivalent to SQL UNIONALL. Potential duplicates are preserved.

3.4. INTERSECTION

Purpose: Retain only rows present in both A and B (comparison based on all columns).

First Name

Last Name

Age

001

Harry

Potter

Behavior: Only one instance of each common row is returned (even if duplicated in the sources).

3.5. SUBTRACTION (EXCEPT)

Purpose: Return rows from DataBlockA that are absent from DataBlockB or differ by at least one value.

No	First Name	Last Name	Age
002	John	Wick	42
003	James	Bond	45

Duplicates are removed in the result, consistent with Spark behavior for the except function.

3.6. SUBTRACTION WITH DUPLICATES (EXCEPTALL)

Purpose: Same logic as Subtraction, but preserves the exact number of occurrences remaining after comparison.

No	First Name	Last Name	Age
002	John	Wick	42
002	John	Wick	42
003	James	Bond	45