Data

The ecosystem exists for:

Extracting data from disparate sources
Architecting and managing data pipelines for transformation, integration, and storage
Architecting and managing data repositories
Automating and optimizing workflows and flow of data between systems.
Developing applications needed through the data engineering workflow.

Data sources:

Relational Databases
Nonrelational DB
APIS
Web Services
Data Streams
Social Platforms
Sensor Devices

Data Repositories

OLTP(Online transaction Processing): Designed to store high volume day-to-day operational data. Typically relational, but can also be non - relational
OLAP ( Online Analytical Processing): Optimized for conducting complex data analytics. Include relational and non - relational databases, data warehouses, data marts, data lakes, and big data stores.

Data Integration

Collated -> Processed ->Cleansed ->Integrated ->Users Combine data from disparate sources into a unified view, accessed by users to query and manipulate the data.

Data Pipelines

A set of tools and processes that cover the entire journey of data from source to destination systems

Languages

Query lang (SQL) Prog Lang (Python) Shell and Scripting lang (Repetitive operational tasks)

Business Intelligence(BI) and Reporting Tools:

Collect data for a visual format like interactive dashboards Visualize real-time and pre-defined schedule Drag n drop tools

Types of data

Structured: SQL Db, OLTP, Spreadsheets, Online forms, Sensors GPS and RFID, Network n web server logs
Semi-Structured: E-mails, XML and markup lang, Binary Executables, TCP/IP packets, Zipped files, Integration of data. ( XML and JSON allow users to define tags and attributes)
Unstructured Data: Web pages, Social media feeds, Images in varied file formats, Video and Audio files, Documents and PDF Files, PPT, Media Logs, and Surveys. ( can be stored in Files and docs to perform manual analysis and can also be stored in NoSQL has its own analysis tools)

File Formats:

Delimited file formats or. CSV ( Flat Files)
Microsoft Excel opens.XML Spreadsheet, or. XLSX
Extensible Markup Language or. XML
Portable Document Format or. PDF
Javascript Object Notation or. JSON

APIs and Web Services

The APIs and Web Services take input from web requests or network requests and output the data in the form of different file types. Examples : API Examples

Twitter and FB APIs: Customer Sentiment Analysis
Stock Market APIs: Trading Analysis
Data Lookup and Validation APIs: Cleaning and co-relating data Web Scraping :
extract relevant data from unstructured sources
Known and screen scraping, web harvesting, and web data extraction
Downloads specific data based on defined parameters
Extract text, contact info, images, videos, prod items, etc.
Uses: Generating sales leads through public data sources, data from posts and authors on forums and communities
Tools Used: Beautiful Soup, Scrapy, Pandas, Selenium. Data Streams and feeds :
Aggregating streams of data flowing from instruments, IoT Devices and applications, GPS data from cars, Computer Programs, Websites, and social media posts.
Uses: Stock and market tickers for financial trading, Retail transaction streams for predicting demand, and supply chain management.
Tools used: Apache Kafka, Apache Spark, and Apache Storm.
RSS ( Really Simple Syndication) feeds: Capturing updated data from online forums and new sites where data is refreshed on an ongoing basis.

Unix/Linux Shell:

Computer program was written for UNIX Shell. Series of UNIX Commands written in a plain text file to accomplish a specific task.
used for :
File manipulation
Program Execution
System administration tasks such as disk backups and evaluating system logs
Installation scripts for complex programs
Executing routine backups
Running batches

PowerShell :

cross platform automation tool and configure framework optimized for working with structured data formats such as JSON, XML, and REST APIs, websites, and office applications.
Consists of cmd line shell and scripting language
Is Obj based and can be used to filter, sort, measure, group, and compare objects as they pass through a data pipeline.
Used for data mining, building GUI's, creating charts, dashboards, and interactive reports.

Data Warehouse works as a central repository that merges information coming from disparate sources and consolidates it through the extract, transform, and load process also known as the ETL process, into one comprehensive database for analytics and business intelligence.

Popular metadata management tools include:

IBM InfoSphere Information Server
CA Erwin Data Modeler
Oracle Warehouse Builder
SAS Data Integration Server
Talend Data Fabric
Alation Data Catalog
SAP Information Steward
Microsoft Azure Data Catalog
IBM Watson Knowledge Catalog
Oracle Enterprise Metadata Management (OEMM)
Adaptive Metadata Manager
Unifi Data Catalog
data.world
Informatica Enterprise Data Catalog

SUMMARY

Metadata is data that provides information about other data and includes three main types: technical, process, and business metadata
The technical metadata for relational databases is typically stored in specialized tables in the database called the system catalog
A primary objective of business metadata management modeling is the creation and maintenance of a reliable, user-friendly data catalog
Having access to a well-implemented data catalog greatly enhances data discovery, repeatability, and governance, and can also facilitate access to data
Metadata management tools from IBM include InfoSphere Information Server and Watson Knowledge Catalog

Note (Non - relational databases): A non-relational database is a database that does not use the tabular schema of rows and columns found in most traditional database systems. Instead, non-relational databases use a storage model that is optimized for the specific requirements of the type of data being stored. For example, data may be stored as simple key/value pairs, as JSON documents, or as a graph consisting of edges and vertices.

The term NoSQL refers to data stores that do not use SQL for queries. Instead, the data stores use other programming languages and constructs to query the data. In practice, "NoSQL" means "non-relational database," even though many of these databases do support SQL-compatible queries. However, the underlying query execution strategy is usually very different from the way a traditional RDBMS would execute the same SQL query.

Data Ecosystem, Repositories, Pipelines and Integration - Part 1