Repositories

boxes.svg

General - Filing

The Repository function of DataChain is accessible from the GenericsData module.Logo GenericsData

The number of repositories that can be created is not limited.

Repository allows to define a way to read data from a connector (Local, DB, or other).
Each connector must be linked to at least one repository.
In the case of rapid integration, only repository is created : you will not find any "Local" connector.

The repository represents Level 2 of the DataChain value chain.

String of values

This function is essential for data consumption in DataChain.

The repository is always associated with an connector plug.

The connector defines how data is read.

Depending on the connector, the reader types can be varied.

A deposit feeds one or more Business Entities credit-card-front.

Creating a Repository

List of Existing Repositories

A repository is created from the GenericsData.Logo GenericsData module.

  • Access module GenericsData.Logo GenericsData

  • Choose, in the left menu GenericsData, the option Repositories associated with the icon. boxes.svg.svg

Liste Depots

Metadata

  • Click on the button. add_time-line.

  • Each Repository has a metadata panel. Entering a label is required.

    _Optional input fields allow you to provide additional information. An icon can thus be assigned to the repository via the commands present in its metadata_panel.

information It is advisable to save this panel as soon as you enter it. Use the button Save button located in the right part of the top banner of the screen.

Choice of a connector

Choice of connector

Two main types of connector are available in the DataChain offer

Local connector (or connectorless mode)

DataChain embeds a connector in its base deployment. It allows to integrate data without the need to create a connector.

Warning Note that when using a local connector, the data will be physically stored in the DataChain context.

To use the local connector, click on the option without connector 1.

External Connector

To use the external repository mode, click on the option button de

The external deposit mode option forces to specify an already existing connector in DataChain. To choose an connector plug, use the three choice box.

The list proposes all authorized connectors.

Warning Note that when using an external connector, the data will be physically outside the *DataChain context *.

As a reminder, here are the types of connectors present in DataChain:

  • Local: without configuration, native and not accessible from connector management.

  • SFTP *HTTP

  • HTTPS

  • S3 (AWS)

  • Database Sql and NoSql

  • HDFS

  • ElasticSearch

  • …​

information Depending on the connector chosen, the repository settings may vary.

Deposit Types - Setup

Connectors: Local Connector
File with separator
  • Text Identifier: Specifies the character that is used as an escape character.

  • Separator: Specifies the character that is used as the separator character.

  • Encoding: Indicates the computer character format (encoding) used by the file to be processed in order to take special characters into account. Islandst positioned by default in UTF-8. This value can be changed.

  • Reading mask: Indicates the file reading mask. Only the files corresponding to the reading mask will be taken into account.

  • Reading modes: 3 possible modes

    • PERMISSIVE: Scans all rows: NULL values are inserted in place of missing values and extra values are ignored.

    • DROPMALFORMED: Drops rows with fewer or more values than expected or values that don’t match the pattern.

    • FAILFAST: Aborts with a RuntimeException if a malformed row is encountered

  • Header: Indicates whether the first line contains the column headers.

  • Multilines: Option used to manage the case of files containing line breaks in a column.

Parquet file
  • Reading mask: Indicates the file reading mask. Only the files corresponding to the reading mask will be taken into account.

Json file
  • Reading mask: Indicates the file reading mask. Only the files corresponding to the reading mask will be taken into account.

  • Encoding: Indicates the computer character format used by the file to be processed in order to take special characters into account. It is automatically detected, by default.

  • Multilines: Indicates whether the Json file contains n string Json (MultiLine to YES) or if the file contains a single Json structure

  • Json Path: Determines the level for header detection

  • Explode(s): Indicate if one (or more) of explode operations must be performed at the JsonPath level (1 by default)

XML file
  • New Line Tag: The line tag of your xml files to be treated as a line.

  • Reading modes: 3 possible modes

    • PERMISSIVE: Scans all rows: NULL values are inserted in place of missing values and extra values are ignored.

    • DROPMALFORMED: Drops rows with fewer or more values than expected or values that don’t match the pattern.

    • FAILFAST: Aborts with a RuntimeException if a malformed row is encountered

  • Encoding: Indicates the computer character format (encoding) used by the file to be processed in order to take special characters into account. It is positioned by default in UTF-8. It is possible to change this value.

  • Ignore spaces before or after data: Indicates whether white spaces around read values should be ignored. The default is No

  • Treat empty values as null values: Indicates whether the space character should be treated as a null value. The default is No

  • Reading mask: Indicates the file reading mask. Only the files corresponding to the reading mask will be taken into account.

Excel
  • The address of the data: Indicates the workbook and the area in the Excel file which must be read. Example My Sheet!A1:K225

  • Workbook password: If the Excel file is protected by Password, it is mandatory to specify it in this input area.

  • Reading mask: Indicates the file reading mask. Only the files corresponding to the reading mask will be taken into account.

Warning Warning headers of numeric type columns (which may come from formulas in excel) are not accepted and generate an error during integration

Binary
  • Reading mask: Indicates the file reading mask. Only the files corresponding to the reading mask will be taken into account.

Connectors: SFTP, HDFS and S3
File with separator
  • Text Identifier: Specifies the character that is used as an escape character.

  • Separator: Specifies the character that is used as the separator character.

  • Encoding: Indicates the computer character format used by the file to be processed in order to take special characters into account. It is automatically detected, by default.

  • Reading mask: Indicates the file reading mask. Only the files corresponding to the reading mask will be taken into account.

  • Reading modes: 3 possible modes

    • PERMISSIVE: Analyzes all rows: NULL values are inserted instead ofmissing values and extra values are ignored.

    • DROPMALFORMED: Drops rows with fewer or more values than expected or values that don’t match the pattern.

    • FAILFAST: Aborts with a RuntimeException if a malformed row is encountered

  • Header: Indicates whether the first line contains the column headers.

  • Multilines: Option used to manage the case of files containing line breaks in a column.

  • Path: Indicates the location of the files to be processed.

Parquet file
  • Reading mask: Indicates the file reading mask. Only the files corresponding to the reading mask will be taken into account.

  • Path: Indicates the location of the files to be processed.

Json file
  • Reading mask: Indicates the file reading mask. Only the files corresponding to the reading mask will be taken into account.

  • Encoding: Used to indicate the computer character format used by the file to be processed in order to take special characters into account. It is automatically detected, by default.

  • Multilines: Indicates whether the Json file contains n string Json (MultiLine to YES) or if the file contains a single Json structure

  • Path: Indicates the location of the files to be processed.

  • Json Path: Indicates the level for header detection

  • Explode(s): Indicates whether one or more explode operations must be performed at the JsonPath level (1 by default)

XML file
  • New Line Tag: The line tag of your xml files to be treated as a line.

  • Reading modes: 3 possible modes

    • PERMISSIVE: Scans all rows: NULL values are inserted in place of missing values and extra values are ignored.

    • DROPMALFORMED: Drops rows with fewer or more values than expected or values that don’t match the pattern.

    • FAILFAST: Aborts with a RuntimeException if a malformed row is encountered

  • Encoding: Indicates the computer character format (encoding) used by the file to be processed in order to take special characters into account. It is positioned by default in UTF-8. It is possible to change this value.

  • Ignore spaces before or after data: Indicates whether white spaces around read values should be ignored. The default is No

  • Treat empty values as null values: Indicates whether the space character should be treated as a null value. The default is No

  • Reading mask: Indicates the file reading mask. Only the files corresponding to the reading mask will be taken into account.

  • Path: Indicates the location of the files to be processed.

Excel
  • The address of the data: Indicates the workbook and the area in the Excel file which must be read. Example My Sheet!A1:K225

  • Workbook password: If the Excel file is protected by Password, it is mandatory to specify it in this input area.

  • Reading mask: Indicates the file reading mask. Only the files corresponding to the reading mask will be taken into account.

Warning Warning headers of numeric type columns (which may come from formulas in excel) are not accepted and generate an error during integration

  • Path: Indicates the location of the files to be processed.

Binary
  • Reading mask: Indicates the file reading mask. Only the files corresponding to the reading mask will be taken into account.

  • Path: Allows you to define the location of the files to be processed.

Connectors: Http / Https / REST
File with separator
  • Text Identifier: Specifies the character that is used as an escape character.

  • Separator: Allows you to define the character that is used as the separator character.

  • Encoding: Indicates the computer character format used by the file to be processed in order to take special characters into account. It is automatically detected by default.

  • Reading modes: 3 possible modes

    • PERMISSIVE: Attempts to parse all rows: NULL values are inserted in place of missing values and extra values are ignored.

    • DROPMALFORMED: Deletes rows containing less or more values than expected or values that do not match the pattern.

    • FAILFAST: Aborts with a RuntimeException if a malformed row is encountered

  • Header: Indicates whether the first line contains the column headers.

  • Multilines: Option used to manage the case of files containing line breaks in a column.

  • Method: Method to apply GET or POST

  • URI: Specifies the URL that will be consumed by the Http / Https connector. Use the magnifying glass located at the end of the line to achieve a more structured entry of the URI using a URI Parser function.

  • header: Allows to generate keys - Values for the header

  • Body: For the POST method, allows you to specify the Body

Parquet file
  • Method: Method to apply GET or POST

  • URI: Indicates the URL that will be consumed by the Http / Https connector

  • header: Allows to generate keys - Values for the header

  • Body: For the POST method, allows you to specify the Body

Json file
  • Encoding: Indicates the computer character format used by the file to be processed in order to take special characters into account. It is automatically detected by default.

  • Method: Method to apply GET or POST

  • URI: Indicates the URL that will be consumed by the Http / Https connector

  • header: Allows to generate keys - Values for the header

  • Body: For the POST method, allows you to specify the Body

  • Multilines: Indicates whether the Json file contains n string Json (MultiLine to YES) or if the file contains a single Json structure

  • Json Path: Indicates the level for header detection

  • Explode(s): Indicates whether one or more explode operations must be performed at the JsonPath level (1 by default)

JSON V2 File

The "JSON V2" reader differs from the "JSON" reader in that it allows users to freely select elements from the source data to define repository headers.

Therefore, the interface includes all configuration fields from the "JSON" reader except Json Path and Explode(s). This reader also provides enhanced integration capabilities for a wider variety of file formats inspired by JSON syntax, with options that allow finer control over selecting and configuring repository headers extracted from the source data.

Integration of JSONL files (JSON Lines)

The JSON Lines format is a variant of JSON designed to efficiently store and process large volumes of structured data, particularly in streaming or logging contexts (its documentation is available here).

In this format, each line contains a complete and valid JSON object. Example (automated customer support application):

{"ticket_id": "12345", "text": "I can't log into my account.", "category": "authentication"}
{"ticket_id": "12348", "text": "The website has been slow since this morning.", "category": "performance"}
{"ticket_id": "12349", "text": "I want to cancel my subscription.", "category": "cancellation"}

To ensure proper integration of this file type, simply enable the JSON line option in the "Reading Settings" section of the repository configuration interface. Repositories with JSON Lines source files without this option enabled will return only the first line of data.

Managing Header Types

A Type Detection option, enabled by default, allows automatic detection of future repository headers data types. Disabling it assigns the type String to all repository headers.

In the new "Header Selection" modal, accessed by the button Select Headers button, the button "Convert to string" to the right of each JSON source schema attribute allows interpreting structured type attributes, list type attributes, and other primitive types (dates, integers, decimals) as strings.

This feature is particularly useful for preserving the original format of numerical data that are not numbers, such as postal codes or SIRET numbers, both of which may begin with 0 (Aisne -02-).

In the repository header selection modal, array-type elements (denoted by the symbol [ ]) containing child elements of structure type and structure-type elements (denoted by the symbol { }) are systematically converted into strings.

Integrating GeoJSON Files

The JSON V2 reader simplifies handling GeoJSON files, which differ from traditional JSON by their flexible data structure. Specifically, their geometry field includes coordinates with data structures varying according to the geometry type.

As specified in its standard RFC 7946, GeoJSON may contain (among other geometry types) coordinates stored as:

  • a simple list, for POINT type geometry. Example: [100.0, 0.0]

  • a list of lists, for LINESTRING type geometry. Example: [[100.0, 0.0], [101.0, 1.0]]

  • a list of lists of lists, for POLYGON type geometry. Example: [[[100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0]]]

The header selection modal manages this flexibility in the data schema definition by allowing users to retrieve the entire geometry element from the source as a string. This avoids applying a single schema for reading coordinates to the varying structure of this GeoJSON field and prevents errors in extracted source data.

The button illustrated by the "Pushpin" icon enables retrieving this raw string data of an entire structure type element from the data schema. In the example below, the user clicked this button to retrieve the raw description of the geometry field from the GeoJSON source, regardless of its type or the data structure of its coordinates:

schema detection

Available actions from this window via buttons described by their tooltips:

Show all elements: expand the tree view to display all child elements

Collapse schema: hide all child elements beyond the first level

Select all (leaf elements): automatically check all final elements without sub-elements or arrays

Deselect all (leaf elements): uncheck all final elements without sub-elements or arrays

Rename duplicates: automatically rename identical labels to differentiate them

Reset labels: restore original labels for all schema elements

Generate automatic labels: assign names based on the hierarchical structure of elements

Recalculate schema: restart automatic detection of the data schema from the repository

Merge sub-elements: combine sub-elements into a single string of raw data

Force text conversion: convert specific elements to strings (required for arrays containing structures and structures themselves)

XML file
  • New Line Tag: The line tag of your xml files to be treated as a line.

  • Reading modes: 3 possible modes

    • PERMISSIVE: Scans all rows: NULL values are inserted in place of missing values and extra values are ignored.

    • DROPMALFORMED: Drops rows with fewer or more values than expected or values that don’t match the pattern.

    • FAILFAST: Aborts with a RuntimeException if a malformed row is encountered

  • Encoding: Indicates the computer character format (encoding) used by the file to be processed in order to take special characters into account. It is positioned by default in UTF-8. It is possible to change this value.

  • Ignore spaces before or after data: Indicates whether white spaces around read values should be ignored. The default is No

  • Treat empty values as null values: Indicates whether the space character should be treated as a null value. The default is No

  • Reading mask: Indicates the file reading mask. Only the files corresponding to the reading mask will be taken into account.

  • Method: Method to apply GET or POST

  • URI: Indicates the URL that will be consumed by the Http / Https connector

  • header: Allows to generate keys - Values for the header

  • Body: For the POST method, allows you to specify the Body

XML File V2

The "XML File V2" reader differs from the "Xml" reader in that it allows, like the reader "JSON V2 File", free selection of elements from the file that the user wishes to set as headers for their repository.

The interface includes all configuration fields from the "Xml" reader.

Once the configuration is done, the button Data schema detection button opens the modal window allowing the user to finely select elements to be used as repository headers.

Excel
  • The address of the data: Indicates the workbook and the area in the Excel file which must be read. Example My Sheet!A1:K225

  • Workbook password: If the Excel file is protected by a Password, it is mandatory to specify it in this input area.

  • Reading mask: Indicates the file reading mask. Only the files corresponding to the reading mask will be taken into account.

  • Method: Method to apply GET or POST

  • URI: Indicates the URL that will be consumed by the Http / Https connector

  • header: Allows to generate keys - Values for the header

  • Body: For the POST method, allows you to specify the Body

Warning Warning headers of numeric type columns (which may come from formulas in excel) are not accepted and generate an error during integration

Advanced settings Deposit on Http/Hhtps connector (GET and POST)

HTTP(S) Parameter

  • 1 General Settings (some settingsdepend on the expected return type (CSV, JSON, etc…​)

  • 2 Allows to add key-values in the header of the request

  • 3 Iteration management

Iterations by Offset

HTTP(S) Parameter

  • 1 Variable automatically added to URI

  • 2 Waiting time between each iteration

  • 3 Read start line

  • 4 Number of line(s) read at each iteration

  • 5 Number of iteration(s)

  • 6 Content in body returned by 3rd party for read completion (with or without REGEX)

  • 7 Content in the header returned by the third party for the end of reading (with or without REGEX)

Iterations per Page

HTTP(S) Parameter

  • 1 Variable automatically added to URI

  • 2 Waiting time between each iteration

  • 3 Reading start page

  • 4 End of reading page

  • 5 Number of iteration(s)

  • 6 Content in body returned by 3rd party for read completion (with or without REGEX)

  • 7 Content in the header returned by the third party for the end of reading (with or without REGEX)

Example POST parameter

Example of configuration with the POST method: 4 iterations maximum with a skip of 50

Connectors: Database type NoSql
MongoDB
  • Collection: Specify in this input box the MongoDb collection

NEO4J
  • Cypher request: Specify in this zone the request to be made. Click "Preview" to display the query result data.

Connectors: Sql Database
  • Script: Allows you to define the Sql script that will be used when retrieving data.

Save settings

  • Once the settings have been made, click on the button save-button_light

Warning Following recording and in order to deduce the headers available in the read file, it is mandatory to perform the Synchronization of headers action. Click on the Synchro button located in the 9 area to perform this operation.

Warning The synchronization of the headers must be carried out after the creation of the repository.

Functions available for deposit management

Description of the functions screen in a repository

1 Connector and drive definition area.

2 Player settings area. These parameters vary according to the connector and reader used.

3 Business Entities linked tab: lists all the Business Entities that consume the repository.

4 Area containing the functions available for a repository.

5 Headers tab: contains the file reference headers.

6 Remote Files tab: allows you to view the file(s) present in the repository.

Remote File Actions
  • View: Click on the "Magnifying glass" icon located at the end of the line

  • Download: Click on the "Download" icon located at the end of the line

7 Extractions tab: allows you to perform time-stamped extractions of source values. The number of extractions is not limited.

Warning Note that the extractions performed can be consumed by the data blocks.

Warning The Extractions function can be used as a historization function.

8 Filters tab: allows you to generate filters (which will be applied at the level of the link between the Business Entity and the repository) on the extractions in order to be able to partially exploit them in data blocks. Example: The last 3 extractions, the extractions of the last 10 days,…​

9 Mandatory action: synchronize the headers.

Warning A repository can supply n Business Entities. The number of Business Entities that can be fed by the same repository is not limited.

Warning From the list of Business Entities, it is possible to create an Business Entity. In this case, the new Entité Métier will be initialized with the headers of the repository.

File Explorer

The explorer is available for Repositories linked to HDFS, MiNio, S3 and SFTP type Connectors.+ The function makes it easy to view and define the path to the directory containing the remote files to be integrated.

To explore the remote files, click on the magnifying glass located on the "Path" line. Explorer

Editing a Repository

  • Access to the GenericsData module Logo GenericsData

  • Left Menu Bar

  • Menu Choice Deposit connecteur-flat.svg

  • Search in the repository list

    The lists of the elements of the DataChain offer have filter and search functions on columns. Use these functions to find the target repository.

    • Click on the label of the chosen repository or on the icon edit at the end of the line.

Deleting a Repository

  • Access to the GenericsData module Logo GenericsData

  • Left Menu Bar

  • Menu Choice Repositories boxes.svg

  • Search in the repository list

    The lists of the elements of the DataChain offer have filter and search functions on columns. Use these functions to find the target repository.

  • Option 1 Use the suppression icon button at the end of the line and confirm the action.

  • Option 2

    • Click on the label of the chosen repository or on the icon edit at the end of the line.

    • Once the edit repository page is displayed, click on the Delete button then confirm the action.

Quick Reference

Creation of a Deposit

Access GenericData module Logo GenericsData

Steps Objective Stock Landmarks

1

Access to the list of repositories

Click on deposit icon

boxes.svg.svg

2

Creating a new repository

Click on New icon

chart-area

4

The metadata

Enter information

Description Required

5

Choosing a connector

Choose from the list of available Connectors or Use local repository

6

To register

Click on Save button

save-light_button

7

Definition of parameters

Entering settings information

8

Synchronize data structure with repository

Click on the button

Sync

9

To register

Click on Save button

save-light_button