Data

ETL has historically been used for batch workloads on a large scale. They're being used for real-time streaming as well. popular ETL tools: IBM Infosphere information server, AWS Glue, Improvado, Skyvia, Hevo, Informatica.

ETL ( Extract Load Transform):

Gathering raw data
Extracting information needed for reporting and analysis
Cleaning, standardizing, and transforming data into a usable format.
Loading data into a data repository.

Extraction can be through:

Batch Processing: Large chunks of data are moved from source to destination at scheduled intervals. example: Blend, Stitch.
Stream Processing: Data pulled in real-time from the course, transformed in transit, and loaded into the data repository. example: Samza, Storm

Transformation:

Standardizing data formats and units of measurement.
Removing duplicate data
Filtering out data that is not required
Establishing key relationships across tables

Loading:

Loading is the transportation of processed data into a data repository.
Initial Loading: Populating all of the data in the repository.
Incremental Loading: Applying updates and modifications periodically.
Full refresh: Erasing a data table and reloading fresh data.

Load Verification includes checks for:

Missing or Null values
Server performance
Load Failures.

ELT ( Extract, Load, Transform)

Helps process large sets of unstructured and non-relational data
Is ideal for data lakes Advantages :
Since raw data is delivered directly to the destination without the staging environment as ETL shortens the cycle between extraction and delivery.
Allows you to ingest volumes of raw data as immediately as the data becomes available.
Affords greater flexibility to analysts and data scientists for exploratory data analytics.
Transforms only that data that is required for a particular analysis so it can be leveraged for multiple use cases.
More suited to work with big data.

Data Pipelines

Encompasses the entire journey of moving data from one system to another, including the ETL process.
Can be used for both batch and streaming data
Supports both long-running batch queries and smaller interactive queries.
Typically loads data into a data lake but can also data into a variety of target destinations - including other applications and visualization tools.
Example: Beam, Airflow, and DataFlow

Data Integration how does data integration relate to ETL and Data Pipelines?

While data integration combines disparate data into a unified view of the data, a data pipeline covers the entire data movement journey from source to destination systems. In that sense, you use a data pipeline to perform data integration, while ETL is a process within data integration. There is no one approach to data integration.

Capabilities of modern integration platform:

Pre-built connectors and adapters.
Open-source architecture
Optimization for both batch processing of large-scale data and continuous data streams or both.
Integration with big data sources.
additional functionalities for data quality and governance, compliance, and security.
Portability between on-premise and different types of cloud environments.

Data Integration Tools: IBM Offers:

IBM Information Server
cloud Pak for Data
Cloud Pak for Integration
Data Replication
Data Virtualization
InfoSphere Information Server on Cloud
IBM InforSphere DataStage Talend Offers:
Talend Data Fabric
Talend Cloud
Talend Data Catalog
Talend Data Management
Talend Big Data
Talend Data Services
Talend Open Studio others :
SAP
Oracle
Dendo
SAS
Microsoft
QlikQ
Tibco Cloud-based Integration Platform as a Service ( iPaaS):
Adaptia Integration Suit
Google Cloud Cooperation 534
IBM Application Integration Suit on cloud
Informatica's Integration cloud.

SUMMARY A Data Repository is a general term that refers to data that has been collected, organized, and isolated so that it can be used for reporting, analytics, and also for archival purposes.

The different types of Data Repositories include:

Databases, which can be relational or non-relational, each following a set of organizational principles, the types of data they can store, and the tools that can be used to query, organize, and retrieve data.
Data Warehouses, that consolidate incoming data into one comprehensive storehouse.
Data Marts, which are essentially sub-sections of a data warehouse, are built to isolate data for a particular business function or use case.
Data Lakes, serve as storage repositories for large amounts of structured, semi-structured, and unstructured data in their native format.
Big Data Stores, provide distributed computational and storage infrastructure to store, scale, and process very large data sets.

The ETL, or Extract Transform and Load, Process is an automated process that converts raw data into analysis-ready data by:

Extracting data from source locations.
Transforming raw data by cleaning, enriching, standardizing, and validating it.
Loading the processed data into a destination system or data repository.

The ELT, or Extract Load and Transfer, Process is a variation of the ETL Process. In this process, extracted data is loaded into the target system before the transformations are applied. This process is ideal for Data Lakes and working with Big Data.

Data Pipeline, sometimes used interchangeably with ETL and ELT, encompasses the entire journey of moving data from its source to a destination data lake or application, using the ETL or ELT process.

Data Integration Platforms combine disparate sources of data, physically or logically, to provide a unified view of the data for analytics purposes.

ETL, ELT and Data Pipelines, Data Integration