Data Engineering Lifecycle

Data Engineering Lifecycle

Architecting the Data Platform

image.png

Layers of a data platform architecture, A layer represents functional components that perform a specific set of tasks in the data

  1. Data Ingestion or Data Collection layer
  2. Data Storage and Integration Layer
  3. Data Processing layer
  4. Analysis and user Interface Layer
  5. Data pipeline Layer

Data Ingestion or Data Collection layer:

  1. Connect to data stores
  2. Transfer data from data stores to the data platforms in streaming and batch modes.
  3. Maintain information about the data collected in the metadata repository.
  4. Tools: Google DataFlow, IBM Streams, IBM Streaming Analytics on the cloud, Amazon Kinesis, and Kafka. Data Storage and Integration Layer:
  5. Store data processing and long term
  6. Transform and merge extracted data, either logically or physically
  7. Make data available for processing in both batch and stream modes.
  8. Tools: IBM DB2, Microsoft SQL Server, MySQL, Oracle DB, PostgreSQL
  9. Cloud DB is referred to as Database as a Service, tools available in this are IBM DB2, Amazon RDS, Google Cloud SQL, and SQL Azure. In the non-relational DB on the cloud we have: Cassandra, Neo4j, Cloudant IBM, Redis, and MongoDB.
  10. Tools for integration include IBM's cloud Pak for data, IBM's cloud oak for integration, Talend data fabric, and OpenStudio.
  11. Open source Integration tools: Boomi and SnapLogic

Data Processing:

  1. Read data into batch or streaming modes from storage and apply transformations
  2. Support popular querying tools and programming languages
  3. Scale to meet the processing demands of a growing dataset
  4. Provide a way for analysts and data scientists to work with data in the data platform. Transformation Tasks :
  5. Structuring: Actions that change the form and schema or data.
  6. Normalization: Cleaning the database of unused data and reducing redundancy and inconsistency.
  7. Denormalization: Combining data from multiple tables into a single table that can be queried more efficiently.
  8. Data Cleaning: Fixing irregularities in data to provide credible data for downstream applications and uses.
  9. Storage and processing may not always be performed in separate layers.
  10. In RDBMS: Storage and processing can occur in the same layer
  11. In big Data Systems: Data can be first stored in HDFS, and then processed in a data processing engine like a spark.
  12. Data processing layer can also precede the data storage layer, where transformations are applied before the data is loaded or stored in the database.

There is a host of tools available for performing these transformations on data, selected based on the data size, structure, and specific capabilities of the tool. some of these tools are : Spreadsheets, OpenRefine, Google DataPrep, Watson Studio Refinery, and Trifacta Wrangler.

Analysis and user Interface Layer:

  1. Delivers processed data to data consumers such as BI Analysts, Business Stakeholders, dATA scientists and Analysts, and other applications and services.
  2. Querying tools and programming languages SQL, SQL-Like query tools for NoSQL, programming languages like python, R, and Java
  3. API's that can be used to run reports on data for both online and offline processing
  4. APIs that can consume data from the storage in real-time for use in other applications and services.

Data pipeline Layer:

  1. Overlaying the Data Ingestion, data storage, and Integration, and data processing layers is the data pipeline layer with the Extract, Transform and Load tools.
  2. This layer is responsible for implementing and maintaining a continuously flowing data pipeline.
  3. Data pipeline solutions: Apache Airflow and DataFlow.

Factors for Selecting and Designing Data Stores A data store or repository is a general term used to refer to data that has been collected, Organized, and isolated to be used for business operations or mined for reporting and data analysis

A repository can be:

  1. database
  2. Data warehouse
  3. Data Mart
  4. Data Lake
  5. Big Data Store

Primary considerations for designing a data store:

  1. Type of data
  2. Volume of data
  3. Intended use of data
  4. Storage Considerations
  5. Privacy, security, and Governance Needs Type of Data : There are multiple types of databases and selecting the right one is a crucial part of designing. input, storage, search and retrieval, and modification.
  6. Types are Relational and non Relational
  7. Relational (RDBMS) is best for structured data which is a well-defined schema that can be organized in a tabular format.
  8. Non-Relational: Best used for semi-structured and Unstructured data, data that is schema-less and free form. 4 types in NoSQL Db: Key value, Document based, Column based, and Graph based.

Data Lake:

  1. Store large volumes of raw data in its native format, straight from its source
  2. Store both relational and non-relational data at scale without defining the data's structure and schema.

Big-data Store:

  1. Store data that is high-volume, high velocity, of diverse types, needs distributed processing for fast analytics.
  2. Big Data Stores split large files across multiple computers allowing parallel access to them.
  3. Computations run in parallel on each node where data is stored. Intended use of data How you intend to use the data you are collecting :
  4. Number of Transactions
  5. Frequency of updates
  6. Types of Operations
  7. Response Time
  8. Backup and Recovery

The intended use of data also drives scalability as a design consideration. The scalability of a data store is its capability to handle growth in the amount of data, workloads, and users. Normalization is another important consideration at the design stage, it helps.

  1. Optimal use of storage space
  2. Makes database maintenance easier
  3. Provides fasters access to data.
  4. Normalization is important for systems that handle transactional data but for systems designed for handling analytical queries, normalization can lead to performance issues.

Storage Considerations: Design considerations from the perspective of storage: Performance, Availability, Integrity, and Recoverability of Data. A secured data strategy is a layered approach, it includes : Access Control, Multizone Encryption, Data management, and monitoring systems.

  1. Regulations such as GDFP, CCPA, and HIPAA restrict the ownership, use, and management of personal and sensitive data. Data needs to be made available through controlled data flow and data management by using multiple data protection techniques.
  2. This is an important part of a data store design. Strategies for data privacy, security, and government regulations need to be an of a data store's design from the start. Done at a later stage it results in patchwork.

SUMMARY

The architecture of a data platform can be seen as a set of layers, or functional components, each one performing a set of specific tasks. These layers include:

  • Data Ingestion or Data Collection Layer, responsible for bringing data from source systems into the data platform.
  • Data Storage and Integration Layer, responsible for storing and merging extracted data.
  • Data Processing Layer, responsible for validating, transforming, and applying business rules to data.
  • Analysis and User Interface Layer, responsible for delivering processed data to data consumers.
  • Data Pipeline Layer, responsible for implementing and maintaining a continuously flowing data pipeline.

A well-designed data repository is essential for building a system that is scalable and capable of performing during high workloads.

The choice or design of a data store is influenced by the type and volume of data that needs to be stored, the intended use of data, and storage considerations. The privacy, security, and governance needs of your organization also influence this choice.

The CIA, or Confidentiality, Integrity, and Availability triad are three key components of an effective strategy for information security. The CIA triad is applicable to all facets of security, be it infrastructure, network, application, or data security.