Data Ecosystem, Repositories, Pipelines and Integration - Part 2

Data Ecosystem, Repositories, Pipelines and Integration - Part 2

Relational Databases

  1. Collection of data organized into a table structure where the tables can be linked, or related, based on data common to each.
  2. Build on the organizational principles of flat files such as spreadsheets, with data organized into rows and columns with a well-defined schema.
  3. Ideal for optimized storage, retrieval, and processing of data for large volumes of data
  4. Retrieve millions of records in seconds using SQL for querying data.

Relational databases can be:

  1. Open source with internal support
  2. Open source with commercial support
  3. Commercial cloud source examples: IBM DB2, Microsoft SQL Server, MySql, Oracle DB, PostgreSQL.

Cloud-based relational databases or database as a service Examples: Amazon RDS, Google SQL, IBM DB2 on the cloud, Oracle Cloud, and Azure SQL.

Advantages of Relational Databases

Relational databases minimize data redundancy. For example, the information of a customer appears in a single entry in the customer table, and the transaction table pertaining to the customer stores a link to the customer table. Ease of backup and disaster recovery: Relational databases offer easy to export and import options, making backup and restore easy.

Exports can happen while the database is running, making restoring on failure easily. Cloud-based relational databases do continuous mirroring, which means the loss of data on restore can be measured in seconds or less.

ACID-compliance: ACID stands for Atomicity, Consistency, Isolation, and Durability and ACID compliance implies that the data in the database remains accurate and consistent despite failures, and database transactions are processed reliably.

A relational Database is well suited for:

OLTP

  1. Accommodate a large number of users
  2. Manage small amounts of data
  3. Support frequent queries and fast response times Data Warehouses
  4. Can be optimized for online analytical processing ( OLAP) IoT Solutions
  5. Provide the speed and ability to collect and process data from edge services.

Limitations of RDBMS:

  1. Does not work well with semi - Structured and unstructured data
  2. Migration between two RDBMS is possible only when the source and destination tables have identical schemas and data types.
  3. Entering a value greater than the defined length of a data field results in loss of information.

NoSQL Database NoSQL( Not only SQL) is a non-relational DB that provides flexible schemas for storage and retrieval

  1. Built for specific data models
  2. Has flexible schemas that allow programmers to create and manage modern applications.
  3. Do not use a traditional row/column/table DB design with fixed schemas.
  4. Do not use structured query language to query data. Some may support SQL or Sql-like interfaces.

Allows data to be stored in schema-less or free form. 4 common types of No SQL DB : Key Value Stores: ex ( Redis, Memcached, DynamoDB)

  1. Keys and values can be anything from integers, and strings to complex JSON Documents.
  2. Storing user session data, user preferences, real-time recommendations, targeted advertising, and in-memory data caching. NOT A GREAT FIT FOR :
  3. Query data on specific data value
  4. Need relationships between data values
  5. Need multiple unique keys

Document Based: ex( MongoDB, DocumentDB, CouchDB, Cloudant)

  1. Store each record and its associated within a single document.
  2. Enable flexible indexing, powerful ad hoc queries and analytics over collection of documents.
  3. Preferred for eCommerce platforms, medial records storage, CRM platforms, and analytics platforms. NOT A GREAT FIT FOR :
  4. Run complex search queries
  5. Perform Multi-operation transactions

Column Based: ex( Cassandra, Apache HBase)

  1. Data stored in cells grouped as columns of data instead of rows.
  2. Logical grouping of columns is referred to as a column family.
  3. example: For example, a customer’s name and profile information will most likely be accessed together but not their purchase history. So, customer name and profile information data can be grouped into a column family.
  4. Cells corresponding to a column are saved as a continuous disk entry, making access and search easier and faster.
  5. Great for systems that require heavy write requests, storing time-series data, weather data, and IoT Data. NOT A GREAT FIT FOR :
  6. Run complex queries
  7. change patterns frequently.

Graph-Based : ex(Neo4J, CosmosDB)

  1. Graph-based DBs use graphical models to represent and store data.
  2. Useful for visualizing, analyzing, and finding connections between different pieces of data
  3. Great for Social Networks, Product recommendations, network diagrams, Fraud Detection, and Access management. NOT A GOOD FIT FOR:
  4. Process high volumes of transactions.

Advantages of NoSQL:

  1. Ability to handle large volumes of structures, semi-str, and unstructured data.
  2. Ability to run as a distributed system scaled across multiple data centers.
  3. Effective, cost-effective scale-out architecture that provides additional capacity and performance with the addition of new nodes.
  4. Simple design, better control over the availability, and improved scalability that makes it agile, flexible, and supports quick iterations.

Key Differences

Relational Databases Non-Relational Databases RDBMS Schemas rigidly define how all data inserted into the database must be typed and composed NoSQL DBs can be schema-agnostic allowing unstructured and semi-structured data to be stored and manipulated Maintaining high-end, commercial relational database management systems can be expensive Specifically designed for low-cost commodity hardware Supports ACID - Compliance which ensures reliability of transactions and crashes recovery Most NoSQL databases are not ACID Compliant A mature and well-documented technology, which means the risks are more or less perceivable. A relatively newer technology.