Green Header

The File Component in DataOps Suite is the initial step in data extraction after configuring the file-based data source. It is used to extract data from supported file formats such as CSV, XML, AVRO, JSON, PARQUET, and COBOL copybooks, and store the extracted data as a dataset for further use in workflows. Once the data is extracted through the File Component, users can proceed to perform various operations such as data processing, validation checks, and reconciliation.

Important Notice

Prerequisites

A File Component is an input from a CSV, XML, AVRO, JSON, or PARQUET file source to create a dataset.

Important Notice

Getting Started

After establishing a File source connection in DataOps Suite, create a new Dataflow or open any existing Dataflow to execute the file component and then, follow the steps below:

Add the File component:
- Click “+” on the left-side panel to add a Source > File component, or
- Drag and drop the File component to the Dataflow Diagram from the right-side panel under Source > File.

The "File" component wizard opens with the Properties page.

File Component Wizard Details

When you want to create or edit a File Component, navigate through the File Component wizard, set up Properties and File, and run the component.

Properties

This is the first step in configuring the File component. Here, you define the basic settings for how the component should behave.

The unique name of the File component is displayed automatically in the "Name" field. However, the name can be changed as per the user's requirement.
Select the File data source from the "Data Source" drop-down.
Enter a meaningful description for the File component in the "Description" field. This is optional.
Select the dependency components if this File component depends on the completion of another component in the "Dependency" field. The File component will only run after its dependencies are executed.
Select the appropriate execution option from the "Execution Option" drop-down.

Important Notice

The dataset name to identify and store the output of the component is automatically displayed. However, the user can update the dataset name as per the business requirement.

Select one or more control-specific behaviors from the "Advanced Options".

Enable Trim: Trims the extra spaces before/after the column values.
Add Source File Name to Dataset: Adds the source file name as column (file_name) to the created dataset. This is helpful when processing multiple files.

Important Notice

INFO

When processing multiple files together, it can be difficult to identify which record belongs to which source file. To make this easier, the system automatically adds a new column called file_name to the dataset. This column stores the name of the file from which each record was loaded.

Example: If the folder "Automation_folder" contains two files customers-1.csv and customers.csv, both with the same column headers, the system will add a new column file_name to the dataset. This allows you to see exactly which file each record came from — for instance, one row may show "customers-1.csv" while another shows "customers.csv" in the "file_name" column.

Both files may have the same column headers. Without the file name column, the user would not know the record of data is coming from which source file. This feature ensures traceability and helps you easily filter or audit records based on their source file.

Fail the component when the dataset is empty: When enabled, it ensures that the component will automatically fail if the underlying SQL query returns zero records and the dataset is empty to avoid misleading progress indicators.
Exclude in the notification if passed/completed: If this option is selected, the email notifications that you receive when pipeline Dataflow tasks are triggered will NOT include the detailed status information for passed or completed components of that Dataflow.

A sample screenshot of the "File" component is shown below.

File

The "File" tab in the File Component is where users configure the details needed to read data from a fixed-length file.

Folder/File Details

The user must either select a file from storage (using the folder icon) or enter the file path manually. This is the source file to be read.

For a better file browsing experience, a File Manager has been implemented to support sub-folders and fetch data files or folders from S3 and Azure Data Lake System (ADLS), Shared, and Local file locations.

The user can upload any file from the local machine to the S3, ADLS, Local, or Shared file location using the Browse file button. The user can select any folder or file to fetch the data and then click the OK button.

Important Notice

A sample screenshot of the File Manager is shown below:

For the Files in the Shared Folder data source connection, it is possible to execute the folder now. A folder can have many files, so for a better output, it is important to maintain a similar structure of files. This functionality allows you to efficiently integrate data from multiple files within a single shared folder without needing to specify each file individually.

Example: Let us consider a CSV-shared folder data source connection (e.g., CSV_local) having an "Automation_folder" folder.

The folder “Automation_folder” is selected for the execution as shown below.

Handling Duplicate File Uploads

When you upload a file to the File Manager, the system checks if a file with the same name already exists in the selected data source location.

If a duplicate is detected, a warning message will appear:

"File Already Exists – Do you want to overwrite the existing file?"

You will then have two options:

Cancel: Stops the upload process and keeps the existing file unchanged.
Overwrite: Replaces the existing file with the new file you are uploading.

File Options

The read and write options for the selected file source will be displayed. Add new options or modify the existing ones. By default, you will see the options used while creating the data source connection. If there is a need to customize the options, use the add, edit, or delete buttons.

Encode

Detects the encoding format if the file source was created using an encoding character.

Important Notice

Result

On the Result page, you can execute the File component and preview the results.

Run a Component

To run a component, perform any one of the following operations:

Click the Run button to run the component without the dependent component.

Select Run with dependencies from the drop-down to run with the dependent component.

When the component runs successfully, the status will be displayed as "Completed".

To preview the results, navigate to the following tabs:

Preview

This tab displays the data of the file. The user can customize the number of rows to display in the output. The default count is 50.

Schema

This tab displays the columns and their data type used in the query. The Download Schema icon allows users to download the schema information of the queried data in CSV format. This is particularly useful for understanding the structure and data types of the columns in the result set. This downloaded schema can then be used for further analysis that requires knowledge of the data structure. The user can also filter the columns and their data types for a quick search.

Statements

This tab displays query statements of SQL and Spark types. When the component is being run, you can see the status "queued" for the statements. Once the statement is run successfully, the status will be shown as "OK."

A sample screenshot of the output of the file component is shown below.

Prerequisites

Getting Started

File Component Wizard Details

Properties

File

Folder/File Details

Handling Duplicate File Uploads

File Options

Encode

Result

Run a Component

Preview

Schema

Statements

See Also