Release Notes

Version 8.4.1

Management of Orphan Projects

Administrator can now choose to whom to assign an orphan project.

Fixes & improvements

Fixed a bug preventing the use of certain formulas, only when the application is deployed on a Spark cluster
When publishing an exposure in draft status, the user who created this exposure was not updated in the exposure history table.

Version 8.4.0

DataBlock export

You can define a path and filename dynamically from a mini language in order to create writing patterns when exporting a DatBlock.
This option is also available for export templates.

Inactive elements

Inactive elements are no longer visible in the lists of elements available to be consumed in another element. += Inactive elements are no longer visible in the lists of elements available to be consumed in another element. Consumed elements can no longer be deactivated.

Date management

It is now possible to configure, at the spark deployment level, an instance variable allowing the time zone to be positioned when aggregating on dates.

SMTP settings

The email sending settings (SMTP) can now be done during deployment via the configuration file and in this case can no longer be modified using the interface.

Corrections & improvements

Formulas and sorting: the formula label is now visible in the rule definition popup.
Centring zone: Items are now correctly aligned in the display interface.
DataBlock: correction of an anomaly in column valuation.

Version 8.3.2

Unit functions

New functions are available for processing data from DataBlocks and the data sources of HandleData’s visualisation elements.

They will replace similar functions in the next major release (9.0).

The new functions on constants of type dates available should be used instead of the deprecated functions.

Version 8.3.0

Changes to date management

A global upgrade has been made to Date (TimeStamp) management. This involves various changes to the interfaces and functions of GenericsData and HandleData.

All the information on handling Dates in DataChain is available in the Operations on data Dates menu.

Display of Date data (TimeStamp)

In order to simplify the reading and manipulation of Date type values throughout the value chain, they are now displayed by default with the simplified ISO 8601 format display mask. + The time zone depends on the settings of your DataChain instance (UTC by default). The time zone depends on the parameters of your DataChain instance (UTC by default), contact a DataChain administrator if in doubt.
When hovered, Date values are always displayed in full ISO 8601 format and in UTC, regardless of the display mask applied.

Simplified ISO in UTC: 2023-09-01 13:42:24
Full ISO in UTC: 2032-09-01T13:42:24:000Z

Masks, time zones and languages

The Masks used for reading, writing or displaying Dates can be modified and custom.

The xref:transverse/date.adoc#display-mask used by default for Dates type values can be modified if required, at different levels of the Value String :

at the DataBlock stage output
in the HandleData display Sources

When adding or configuring the Mapping of a Business Entity, or when converting to the Date type at the output of a DataBlock step, you can define an time zone and a language in addition to the reading mask.

Unit functions

2 new functions are available for processing data from DataBlocks and data sources in HandleData’s visualisation elements, and will soon replace 2 similar old functions.

New functions

date.to_strz transforms a date into a character string using a writing mask, a time zone and a language.
str.to_datez transforms a String into a Date using a read mask, a time zone and a language

Deprecated functions

The following functions are now deprecated in favour of the new functions and will be removed during 2024.

str.to_date: use str.to_datez instead. * date.to_str : use date.to_strz instead.

Dashboards

To ensure that the Dashboards display works correctly in "Caroussel" mode, you need to revalidate the settings for the centring zones on Date type values that contain the selector option.

Other changes

Dashboards: the maximum height of Dashboard containers has been doubled, allowing more elements to be positioned.

Version 8.2.0

Changes

Export templates

It is now possible to set reusable Export Templates for all Connectors available in DataChain (Local, Database, Remote Server).
You must have the global permission linked to Export Templates to be able to create and use Templates.

By default, this new global permission is inactive for all groups and users.
DataChain Administrators can modify permissions in the group and user management pages.

Use a Template when exporting
Export Templates can be loaded to automatically populate the export form.
Create a Template
Template management is available from the "Miscellaneous" menu.

Exports

The DataBlock export interface has been updated following the addition of the Export Templates +. New new options are now available.

Exhibitions

Adding variables to filters: It is now possible to limit access to data exposure for users or groups using new filter variables.

Unit functions

New functions are available for finding the position of an element in a list.

[type].list.position(ListArg1, Arg2) : returns the position of the *Arg2* value contained in ListArg1.

The index is based on 1
The formula returns -1 if the given value cannot be found in the list

Variants of the formula

str.list.position(ListStrArg1, ListStrArg2): search for a string value + Example: str.list.position(ListStrArg1, ListStrArg2): search for a string value Example : str.list.position(["moly", "peter", "marsh", "carl"], "marsh")
return 3
int.list.position(ListIntListArg1, IntArg2) : search for an integer value
dec.list.position(ListDecListArg1, DecArg2) : search for a decimal value
bigint.list.position(ListBigIntArg1, BigIntArg2) : search for a large integer value
date.list.position(DateListArg1,DateArg2) : search for a date value

Other changes

Exports: the Export history now only contains exports performed by the logged-in user.

Version 8.1.3

Other changes

Various corrections and improvements
- Dataframe Status Api return persistence status, even when persistence job is running.

Version 8.1.2

Other changes

Various corrections and improvements
- Project Expose Api return disabled or not already expose endpoints

Version 8.1.1

News

Aggregations: Addition of two new functions "Standard deviation" and "Variance".
Documentation: Updated documentation for the PowerBi connector.

Other changes

Various corrections and improvements
- The 8.1.0 version of keycloak delivered with Datachain no longer included token exchange functionality.

Version 8.1.0

News

Specification of JSON, XML and CSV file encoding: it is now possible to specify the encoding of Json and xml files at the level of S3 MinIo type repositories.
Authentication on the HTTP connector*: authentication settings added to HTTP-type repositories.
Hash algorithm change*: the hash algorithm used in the str.encrypt formula is sha256. Previous versions used the MD5 algorithm.
Authentication server*: upgrade to Keycloak version 21.1.
Power BI connector: A DataChain connector is now available as an option in PowerBI. It enables DataChain Datablocks exposed from the PowerBI interface to be consumed directly.
Documentation Datachain documentation is now available online at https://docs.adobisdc.fr. The links in the applications use this new site. This new documentation allows us to provide a document search module, as well as a first version in English (all the modules are not yet translated and will be at a later date).

Other changes

Correction of an anomaly in the reading of files based on S3/MinIo connectors.
The reading of very large Excel files has been optimised. There should no longer be a limit.
Various corrections and improvements
- Pattern management in S3 repositories
- Job websocket optimisation for high-volume requests.
- Security: Activation of Pkce authentication https://ordina-jworks.github.io/security/2019/08/22/Securing-Web-Applications-With-Keycloak.html
- Increased password security
- Correction of a malfunction when deleting formulas in Datablock steps.

Version 8.0

News

Version 8 brings a number of new features, including significant performance improvements to the application.

Modification of the deployment architecture

The standard 7.0 architecture is as follows:

Deployment architecture diagram in V7

A Redis component is now integrated in the datachain stack, allowing asynchronous communication between the backend component and the spark component.

V8 deployment architecture diagram

Integration of Spark FAIR mode

In a given Spark application (SparkContext instance), multiple parallel tasks can run concurrently if they have been submitted from separate threads. By "task" in this section we mean a Spark action (e.g. record, collect) and all the tasks that need to be executed to evaluate that action. Spark’s scheduler is fully thread-safe and supports this use case to enable applications that respond to multiple requests (e.g. requests for multiple users).

By default, the Spark scheduler runs tasks in FIFO mode (Datachain V7 case). Each task is divided into "stages" (e.g. map and reduce phases), and the first task has priority over all available resources while its stages have tasks to run, then the second task has priority, etc. If the jobs at the head of the queue do not need to use the whole cluster, subsequent jobs can start running immediately, but if the jobs at the head of the queue are large, subsequent jobs can be significantly delayed.

It is now possible to configure fair sharing between jobs in the new version of Datachain, by enabling spark’s FAIR mode. In fair share, Spark allocates jobs between jobs in a "round robin" fashion, so that all jobs get a roughly equal share of the cluster resources. This means that short jobs submitted while a long job is running can start receiving resources immediately and still get good response times, without waiting for the long job to finish. This is the mode that is now enabled by default in Datachain.

The job processing mode is now as follows:

A job is submitted to the backend, which will drop it into a Redis queue. It is handled by a Spark context, which notifies as the processing progresses (taken on, in processing, in error, finished, etc.). Redis, the queue management system used, is also used to store job execution results. Finally, an event system is also implemented to ensure communication between several deployed backends.

The impacts on configuration and deployment are as follows (only the new configuration items are listed): in the docker-compose deployment file: Adding the Redis service

  dc_redis_cache:
    image: redis:7.0-alpine
    command: redis-server --loglevel warning
    networks:
      dc_network:
        aliases:
          - dc-redis
    deploy:
      mode: replicated
      replicas: 1
      placement:
        constraints:
ADD DEPLOYMENT CONSTRAINTS (to be deployed with the backend image)

In the backend configuration (backend.env file) new parameter dc.app.notification.urls with usual value: redis://dc-redis:6379 Removed parameters: dc.app.api.spark (replaced by dc.app.context.url on the spark side)

JAVA_OPTIONS=-Xms1024M -Xmx4g -Duser.home=/opt/dev/work

## Now useless
## dc.app.api.spark=http://dc-spark1-1:8090,http://dc-spark1-2:8090,
## Simplified below
## dc.app.api.spark.security.user.name=backend,backend,backend,backend,backend,backend,backend,backend
dc.app.api.spark.security.user.name=backend

## Simplified below
## dc.app.api.spark.security.user.password=aaaa,aaaa
dc.app.api.spark.security.user.password=aaaa

## No longer required
## dc.api.spark.shared-file-system=true

## new
# Redis configuration
dc.app.notification.urls=redis://dc-redis:6379

In the spark configuration (spark.env file)

new parameter (same as backend)
dc.app.notification.urls of usual value: redis://dc-redis:6379
new parameters (for more information, see online datachain documentation)
dc.app.job.manager.spark.thread.number=5 and dc.app.job.manager.other.thread.number=5
new parameters (for more information, see online datachain documentation)
dc.app.job.manager.spark.thread.number=5 and dc.app.job.manager.other.thread.number=5
new parameter: dc.app.context.url=http://dc-spark1:8090
Remove parameters (no more http calls from spark to backend)
dc.app.api.backend, spring.security.user.name, spring.security.user.password

## No longer needed
## dc.app.api.backend=http://dc-backend:8089

# Spark context setup
dc.app.context.id=0
## NEW PARAMETER
dc.app.context.name=context_0
## NEW PARAMETER
dc.app.context.url=http://dc-spark1:8090
dc.app.spark.app-name=context1_1

# Configuration spark
## NEW PARAMETER
dc.app.spark.scheduler.mode=FAIR
## NEW PARAMETER
dc.app.job.manager.spark.thread.number=5
## NEW PARAMETER
dc.app.job.manager.other.thread.number=3

## NOW USELESS
# dc.keycloak.auth-server-url=http://dc-keycloak:8080/auth

## NEW PARAMETER
# Configuration notifications
dc.app.notification.urls=redis://dc-redis:6379

Security

Security adjustments have been made to the application, including

some URLS could still carry tokens, this is no longer the case
the backend and spark services could be launched as root. This is no longer natively the case: the servers are now deployed with a dc user with uid 38330 and a datachain group with a GID 38330. Please note that the rights on the directories must be modified on the HDFS trees (in the case of a local deployment or hadoop cluster), to give access to this user who no longer has root privileges

Other news

Dashboards - HandleData

Centering zones

When setting up a centring zone of type Date , it is now possible to set the type of date that should be used when entering values. Two formats are available: simple date or date with time

Setting the presence of a scroll
Setting the presence of a vertical or horizontal scroll for each element present in a dashboard.

Dashboard publication

It is now possible to delete an expired publication.
In previous versions, it was impossible to delete a dashboard if it contained an expired publication.
In order to delete a dashboard containing an expired publication, you must first delete all expired or archived publications from the Publication tab of the dashboard.

HandleData and Generics Data

Management of formulas

A better handling of null values during the processing of some formulas has been done. The formulas impacted are in particular the formulas str.if, int.if, dec.if, bigint.if, str.compare.

Aggregation management

In previous versions, when creating an aggregation, the solution automatically generated a sort in the order of the columns positioned in the aggregation. Since the DataChain 8.0 version, this automatism does not exist anymore.
If needed, the user can set the sorting he wants or not.

Version 7.6.0

News

Deleting a Project from a DataChain Instance: it is now possible to completely delete a Project from a DataChain instance.
A Project can now be deleted from the Project card.
Only Project Administrators can delete a Project. An internet connection is required to validate the action of deleting a Project.
An email containing a validation token is sent to the email address of the person carrying out the deletion. A Project consumed by another Project cannot be deleted.

Note that deleting a Project is definitive. It is advisable to export the Project before deleting it.

Deleting a Project deletes all the elements of a Project, i.e.

Publications
Dashboards
All views (Media, Tables, Charts, TimeLine, Map)
Visualisation sources
Datablocks and Datablock API Exposures (EndPoint and cache)
Business Entities
Repositories and Extractions
Connectors
Files contained in local repositories
All persistence of Business Entities and Datablocks
Filings created in the deleted Project

Neo4j Connector added: A new Neo4J Connector has been added (graph database).

Neo4j Repository added: A new Repository allowing the entry of a Cypher query based on a Neo4J Connector has been added.

Repositories / Download: It is now possible to download files contained in local repositories.

File Explorer: A file explorer has been added for exploring remote content. This explorer is available for HDFS, MinIO, S3 and SFTP Connectors.

The explorer function is available in the Connectors and Repositories functions and when setting up Exports. This function allows you to view remote files. For the Repositories and Exports functions, the explorer lets you choose the target path. Only directories can be selected.

Deleting Dashboard Publications: It is now possible to delete one or more Publications from a Dashboard.

A Dashboard cannot be deleted as long as it contains active and/or archived Publications. To delete a Dashboard, all the related Publications must be deleted.

To delete the Publications, access the Dashboard in Edit mode and then use the Publications tab.

There are several cases to consider when deleting a Publication

The Publication is active but has never been consulted: in this case, following confirmation of deletion, deletion is definitive and immediate.
The Publication is active and has already been consulted: in this case, following confirmation, the Publication changes status and becomes an archived Publication.

Addition of a URI parser for HTTP/HTTPS Connector Repositories: Addition of a URI parser, allowing more structured input of URI parameters. This function is only available for Http/Https Connector Repositories.

AngularJs version update and integration of the Angular stack: Since version 7.6 and with a view to future migrations of certain screens or new developments, the Angular stack has been integrated.

🚑 Corrections

When exposing, when multiple filters apply for a user, they are now cumulated via a logical "or" instead of an "and".
API Exposure - Dates exposed by the api were exposed in the wrong format. They are now in ISO 8601 format.

Version 7.5.4

🚑 Fixes

Improved support for accented files on the SFTP Connector.
Corrections to encoding management on HTTP Connector streams (detection and support).
Allowed the same endpoints to be used for Exhibitions when importing a Project
Use of browser language in Publications
Improved support for excel files

Version 7.5.0

Changes

Project duplication: A Project can now be duplicated within the same DataChain instance.

Export/Import Project: A Project can be exported from one DataChain instance, and then imported into another instance. This import/export can involve all or some of the elements of the Project. Project export traceability : A new screen allows you to view the traces of the instance’s Project exports. This screen can be accessed from GenericsData.

From the left-hand menu, click on "Traceability of Exports".
From the screen, two tabs are available * Items : Monitoring DataBlocks exports * Projects: Project export tracking ** From each trace line, it is possible to download the file generated on the date of the export action.

🚑 Corrections

Batch rights management for members of a Group type Project from the administration screen.
Fixed a problem displaying join target criteria when editing a join from DataBlock steps.
Various minor fixes or improvements to the new Project Duplication and Project Export/Import functions

Version 7.4.2

🚑 Fixes

Fixed a blocking anomaly on the registration of DataBlock created before version 7.4.0.

Version 7.4.1

Changes

DataBlock: CSV export function improved: New parameters are available for CSV exports.

🚑 Fixes and improvements

In the Datablock step output, the null replacement function has been improved for decimal columns. Support for NaN values is now treated in the same way as null values.
Statistics on Date columns. Following the Spark 3 migration, an error when producing statistics for Date columns has been corrected.
Improvements to the Project Home page. Addition of a filter on the action type column.
Correction of a blocking problem preventing a Repository password from being changed, following the installation of the additional encryption key.

Version 7.4.0

Changes

Migration from Spark 2.4. to Spark 3.03

Spark 3 integration: DataChain now integrates the Spark 3 engine (version 3.0.3) in place of Spark 2.4. +. For more information on the developments and changes associated with the release of Spark 3, see here: https://Spark.apache.org/docs/3.0.3/sql-migration-guide.html

Performance gains have been observed on our platform in single-node or cluster mode. All images now use Java version 11.

Spark 3 cluster*: We are making available a Spark 3 cluster and a hadoop cluster, based on the same model as the images previously offered ("bde2020" hadoop and Spark images).
A new repository is available on our registry harbor (with a dedicated key, in order to make these images available).

The repository is "ado-spark-hadoop" and the images are in version 1.0.0.

b9kd47jn.gra7.container-registry.ovh.net/ado-spark-hadoop/master-spark-3.0.3-hadoop-3.2:1.0.0
b9kd47jn.gra7.container-registry.ovh.net/ado-spark-hadoop/worker-spark-3.0.3-hadoop-3.2:1.0.0
b9kd47jn.gra7.container-registry.ovh.net/ado-spark-hadoop/namenode-hadoop-3.3.2:1.0.0
b9kd47jn.gra7.container-registry.ovh.net/ado-spark-hadoop/datanode-hadoop-3.3.2:1.0.0

The images replace the images used previously

bde2020/spark-master:2.4.5-hadoop2.7
bde2020/spark-worker:2.4.5-hadoop2.7
bde2020/hadoop-namenode:2.0.0-hadoop2.7.4-java8
bde2020/hadoop-datanode:2.0.0-hadoop2.7.4-java8

NaN (Not a Number) value handling: NaN value handling is now supported for decimal columns.

QueryBuilder (Rules and filters manager): In the QueryBuilder, a new operator allows you to filter (or retain) NaN values.

Formula manager* : A new formula is available to transform NaN values into Null values.

Formula name: dec.nan_to_null(Argument 1)
Description: Transforms NaN values into null values
Argument 1: Column of type Decimal

DataBlock: Step output : The Null value replacement function has been extended.
This function now takes NaN values into account in the same way as Null values.
Column statistics now take NaN values into account.

Version 7.3.2

Changes

Security correction following an audit.

Cookie management: The DataChain application no longer generates cookies. Only the KeyCloak cookies remain, as they are inherent to the product itself +. In order to set these cookies to Secure, the realm configuration must be set to Require SSL: all request (modification to be made in the dc-realm realm, login tab).
Session cookies cannot be marked as http_only, for reasons relating to the functionality of the KeyCloack product (see https://lists.jboss.org/pipermail/keycloak-user/2017-September/011882.html and https://issues.redhat.com/browse/KEYCLOAK-12880).
Database password outsourcing* : The database password can be overridden via environment variables in the Docker compose of DataChain.
For all postgresql images, the key is POSTGRES_PASSWORD.
The PG_MIGRATION image must have the appropriate environment variables to function during dumps

PG_PASSWORD, PG_KC_PASSWORD and PG_EXPOSE_PASSWORD

Enhanced encryption of passwords in the database :

Until now, passwords have been encrypted in the database. However, the encryption key is unique for all environments. It is now possible to override the existing encryption key (new application parameter (16 characters required) to be declared for the application’s backend image and Spark image: dc.app.backend.cipher.key). This parameter is concatenated with the application’s internal encryption key.

Please note that if this parameter is used, previously configured passwords will have to be modified via the application interface, in order to be usable by the application. This parameter should therefore be implemented when the application is first installed, and not modified thereafter.

Script for detecting type mismatches in DataBlocks following the 7.3.0 upgrade.

This script analyses all the unions and joins present in the DataBlocks stages of a DataChain instance. The analysis highlights type inconsistencies between the source and target columns of the join and/or union criteria. A report shows the DataBlocks where at least one column pair is inconsistent. The DataBlocks in question must be re-mapped for the columns concerned.

The script is run using the following command on the server where the DataChain instance is located

Docker run -v /tmp:/scripts_outputs -it --network_DataChain_network_name -e PG_HOST=dc-pg --entrypoint bash dc/pg_migration:7.3.2 utils.sh detectUnionsJoinsImpact.groovy

Items in bold are to be adapted to the InstanceDataChain dataChain_network_name: found in the DataChain’s Docker compose file (e.g. dc_network) dc/pg_migration:7.3.2 : name of the Docker image to be used, prefixed by the Docker repository used /tmp: name of the directory in which the report file will be saved once the script has been run.

It is possible that the current Docker network does not support the execution of an external container (execution generates an error indicating that it is impossible to connect to the network). In this case edit the Docker-compose.yml file find the network description at the end of the file and the DataChain network