Data Warehouses, Data Marts, and Data Lakes

Data Warehouses, Data Marts, and Data Lakes

Data Warehouses

A data warehouse is a central repository of data integrated from multiple sources. It serves as a single source of truth that has data that is cleansed, conformed, and categorized.

  1. Data loaded in the warehouse is already modeled and structured for a specific purpose, meaning analysis- ready.
  2. Known to store relational data from transactional systems and operational databases such as CRM, ERP, HR, and Finance Applications.
  3. Nonrelational databases are also being used for data warehousing. The Data warehouse has a 3-tire architecture :
  4. Top Tier - Client Front-end Layer ( Querying, reporting, and analyzing data)
  5. Middle Tier - (OLAP Server - Process and analyze information coming from database servers)
  6. Bottom Tier - Database Servers ( extracting data from different sources)

Note:

Data warehouses that once resided in on-premise data centers are moving to the cloud, and the benefits of moving to the cloud are :

  1. Lower costs
  2. Limitless storage and compute capabilities
  3. Scale on a pay-as-you-go basis
  4. Faster disaster recovery.

Popular data warehouses:

  1. Teradata
  2. oracle Exadata
  3. IBM DB2
  4. Netezza
  5. Amazon Redshift
  6. Google BigQuery
  7. Cloudera
  8. Snowflake

Data Marts

A data mart is a sub-section of the data warehouse, built specifically for a particular business function, purpose, or community of users.

Purpose of Data Marts:

  1. Provide data to users that are most relevant to them when they need it
  2. Accelerate business processes
  3. Provide a cost and time-effective way in which data-driven decisions can be taken
  4. Improve end-user response time
  5. Provide secure access and control. Type of data marts :
  6. Dependent
  7. Independent
  8. Hybrid

Dependent Data marts:

  1. Subsection of an enterprise data warehouse
  2. Provides isolated security and isolated performance
  3. Offers analytical capabilities for a restricted area of the warehouse
  4. Created from sources other than an enterprise data warehouse such as internal operational systems or external data.
  5. Pull data from an enterprise data warehouse, where data is already cleaned and transformed.

Independent Data Marts:

  1. Need to carry out the transformation process on the source data since it's coming directly from operational systems and external sources. Hybrid Data Marts:
  2. combine inputs from data warehouses, operational systems, and external systems.

Data Lakes

Store large amounts of structured, Semi-structured, and unstructured data in their native format.

  1. Data can be loaded without defining the structure and schema data.
  2. Exists as a repository of raw data straight from the source, to be transformed based on the use case.
  3. Data is classified, protected, and governed.
  4. Combine a variety of technologies that come together to facilitate agile data for analysts and data scientists.
  5. Can be deployed using cloud object storage such as Amazon S3 or Large scale distributed systems such as Apache Hadoop, Also RDBMS, and NoSQL Data Repositories.

Benefits:

  1. Ability to store all types of data ( Structured, Semi- structured, and unstructured)
  2. Agility to scale based on storage capacity
  3. Saving time in defining structures, schemas, and transformations.
  4. Ability to repurpose data in several different ways and wide-ranging use cases.
  5. Vendors providing architectures, and platforms for data lakes are: Amazon, Cloudera, Google, IBM, Informatica, Microsoft, Oracle Exadata, SAS, Snowflake, Teradata, and Zaloni.