Data Warehouses
A data warehouse is a central repository of data integrated from multiple sources. It serves as a single source of truth that has data that is cleansed, conformed, and categorized.
- Data loaded in the warehouse is already modeled and structured for a specific purpose, meaning analysis- ready.
- Known to store relational data from transactional systems and operational databases such as CRM, ERP, HR, and Finance Applications.
- Nonrelational databases are also being used for data warehousing. The Data warehouse has a 3-tire architecture :
- Top Tier - Client Front-end Layer ( Querying, reporting, and analyzing data)
- Middle Tier - (OLAP Server - Process and analyze information coming from database servers)
- Bottom Tier - Database Servers ( extracting data from different sources)
Note:
Data warehouses that once resided in on-premise data centers are moving to the cloud, and the benefits of moving to the cloud are :
- Lower costs
- Limitless storage and compute capabilities
- Scale on a pay-as-you-go basis
- Faster disaster recovery.
Popular data warehouses:
- Teradata
- oracle Exadata
- IBM DB2
- Netezza
- Amazon Redshift
- Google BigQuery
- Cloudera
- Snowflake
Data Marts
A data mart is a sub-section of the data warehouse, built specifically for a particular business function, purpose, or community of users.
Purpose of Data Marts:
- Provide data to users that are most relevant to them when they need it
- Accelerate business processes
- Provide a cost and time-effective way in which data-driven decisions can be taken
- Improve end-user response time
- Provide secure access and control. Type of data marts :
- Dependent
- Independent
- Hybrid
Dependent Data marts:
- Subsection of an enterprise data warehouse
- Provides isolated security and isolated performance
- Offers analytical capabilities for a restricted area of the warehouse
- Created from sources other than an enterprise data warehouse such as internal operational systems or external data.
- Pull data from an enterprise data warehouse, where data is already cleaned and transformed.
Independent Data Marts:
- Need to carry out the transformation process on the source data since it's coming directly from operational systems and external sources. Hybrid Data Marts:
- combine inputs from data warehouses, operational systems, and external systems.
Data Lakes
Store large amounts of structured, Semi-structured, and unstructured data in their native format.
- Data can be loaded without defining the structure and schema data.
- Exists as a repository of raw data straight from the source, to be transformed based on the use case.
- Data is classified, protected, and governed.
- Combine a variety of technologies that come together to facilitate agile data for analysts and data scientists.
- Can be deployed using cloud object storage such as Amazon S3 or Large scale distributed systems such as Apache Hadoop, Also RDBMS, and NoSQL Data Repositories.
Benefits:
- Ability to store all types of data ( Structured, Semi- structured, and unstructured)
- Agility to scale based on storage capacity
- Saving time in defining structures, schemas, and transformations.
- Ability to repurpose data in several different ways and wide-ranging use cases.
- Vendors providing architectures, and platforms for data lakes are: Amazon, Cloudera, Google, IBM, Informatica, Microsoft, Oracle Exadata, SAS, Snowflake, Teradata, and Zaloni.