A dataflow comes with many components that can retrieve data from data sources, apply changes and transformations, and load data as an output file or table.

A dataflow is run on Livy Session and which in turn creates a session to run components to perform the required changes. A component executed successfully is marked as verified. The data generated by the component is stored in the repository as dataframe and its result is called dataset. Whatever the datasource you are working with, a component's result is always stored in the form of a dataset. A dataframe holds the result until the Livy Session is active. If you close, end or delete the session, the dataframe is emptied.

After the initial input (Source) component is created, most of the components work in the same way –by taking the dataset result created by the Source component or by taking the dataset result of other components. That means, each component takes input from another and produces output for the next one. To have the data flow between components, a dependency dataset must always be need to be selected in the component. This also ensures the dataset is available in the dependent and the latest metadata is retrieved from the source.

The following is a high-level process of creating a Dataflow.

  1. Add a Dataflow.
  2. Add a JDBC (or File) component as input to the Dataflow.
  3. Run the JDBC (or File) component to create a dataset.
  4. Use the dataset created in step 3 as input to run components like Processor and Data Quality. Use a dataset as input that contains the results of latest changes. The output of one component will become an input to another.
  5. Add a Target component to load your data.
  6. Run all components independently. Components run to success are verified. 
  7. Run the Dataflow.
  8. Go to Run History and monitor the run details.

© Datagaps. All rights reserved.
Send feedback on this topic to Datagaps Support