Data Ecosystem, Repositories, Pipelines and Integration - Part 1

Data Ecosystem, Repositories, Pipelines and Integration - Part 1

The ecosystem exists for:

  1. Extracting data from disparate sources
  2. Architecting and managing data pipelines for transformation, integration, and storage
  3. Architecting and managing data repositories
  4. Automating and optimizing workflows and flow of data between systems.
  5. Developing applications needed through the data engineering workflow.

Data sources:

  1. Relational Databases
  2. Nonrelational DB
  3. APIS
  4. Web Services
  5. Data Streams
  6. Social Platforms
  7. Sensor Devices

Data Repositories

  1. OLTP(Online transaction Processing): Designed to store high volume day-to-day operational data. Typically relational, but can also be non - relational
  2. OLAP ( Online Analytical Processing): Optimized for conducting complex data analytics. Include relational and non - relational databases, data warehouses, data marts, data lakes, and big data stores.

Data Integration

Collated -> Processed ->Cleansed ->Integrated ->Users Combine data from disparate sources into a unified view, accessed by users to query and manipulate the data.

Data Pipelines

A set of tools and processes that cover the entire journey of data from source to destination systems

Languages

Query lang (SQL) Prog Lang (Python) Shell and Scripting lang (Repetitive operational tasks)

Business Intelligence(BI) and Reporting Tools:

Collect data for a visual format like interactive dashboards Visualize real-time and pre-defined schedule Drag n drop tools

Types of data

  1. Structured: SQL Db, OLTP, Spreadsheets, Online forms, Sensors GPS and RFID, Network n web server logs
  2. Semi-Structured: E-mails, XML and markup lang, Binary Executables, TCP/IP packets, Zipped files, Integration of data. ( XML and JSON allow users to define tags and attributes)
  3. Unstructured Data: Web pages, Social media feeds, Images in varied file formats, Video and Audio files, Documents and PDF Files, PPT, Media Logs, and Surveys. ( can be stored in Files and docs to perform manual analysis and can also be stored in NoSQL has its own analysis tools)

File Formats:

  1. Delimited file formats or. CSV ( Flat Files)
  2. Microsoft Excel opens.XML Spreadsheet, or. XLSX
  3. Extensible Markup Language or. XML
  4. Portable Document Format or. PDF
  5. Javascript Object Notation or. JSON

APIs and Web Services

The APIs and Web Services take input from web requests or network requests and output the data in the form of different file types. Examples : API Examples

  1. Twitter and FB APIs: Customer Sentiment Analysis
  2. Stock Market APIs: Trading Analysis
  3. Data Lookup and Validation APIs: Cleaning and co-relating data Web Scraping :
  4. extract relevant data from unstructured sources
  5. Known and screen scraping, web harvesting, and web data extraction
  6. Downloads specific data based on defined parameters
  7. Extract text, contact info, images, videos, prod items, etc.
  8. Uses: Generating sales leads through public data sources, data from posts and authors on forums and communities
  9. Tools Used: Beautiful Soup, Scrapy, Pandas, Selenium. Data Streams and feeds :
  10. Aggregating streams of data flowing from instruments, IoT Devices and applications, GPS data from cars, Computer Programs, Websites, and social media posts.
  11. Uses: Stock and market tickers for financial trading, Retail transaction streams for predicting demand, and supply chain management.
  12. Tools used: Apache Kafka, Apache Spark, and Apache Storm.
  13. RSS ( Really Simple Syndication) feeds: Capturing updated data from online forums and new sites where data is refreshed on an ongoing basis.

Unix/Linux Shell:

  1. Computer program was written for UNIX Shell. Series of UNIX Commands written in a plain text file to accomplish a specific task.
  2. used for :
  3. File manipulation
  4. Program Execution
  5. System administration tasks such as disk backups and evaluating system logs
  6. Installation scripts for complex programs
  7. Executing routine backups
  8. Running batches

PowerShell :

  1. cross platform automation tool and configure framework optimized for working with structured data formats such as JSON, XML, and REST APIs, websites, and office applications.
  2. Consists of cmd line shell and scripting language
  3. Is Obj based and can be used to filter, sort, measure, group, and compare objects as they pass through a data pipeline.
  4. Used for data mining, building GUI's, creating charts, dashboards, and interactive reports.

Data Warehouse works as a central repository that merges information coming from disparate sources and consolidates it through the extract, transform, and load process also known as the ETL process, into one comprehensive database for analytics and business intelligence.

Popular metadata management tools include:

  • IBM InfoSphere Information Server
  • CA Erwin Data Modeler
  • Oracle Warehouse Builder
  • SAS Data Integration Server
  • Talend Data Fabric
  • Alation Data Catalog
  • SAP Information Steward
  • Microsoft Azure Data Catalog
  • IBM Watson Knowledge Catalog
  • Oracle Enterprise Metadata Management (OEMM)
  • Adaptive Metadata Manager
  • Unifi Data Catalog
  • data.world
  • Informatica Enterprise Data Catalog

SUMMARY

  • Metadata is data that provides information about other data and includes three main types: technical, process, and business metadata
  • The technical metadata for relational databases is typically stored in specialized tables in the database called the system catalog
  • A primary objective of business metadata management modeling is the creation and maintenance of a reliable, user-friendly data catalog
  • Having access to a well-implemented data catalog greatly enhances data discovery, repeatability, and governance, and can also facilitate access to data
  • Metadata management tools from IBM include InfoSphere Information Server and Watson Knowledge Catalog

Note (Non - relational databases): A non-relational database is a database that does not use the tabular schema of rows and columns found in most traditional database systems. Instead, non-relational databases use a storage model that is optimized for the specific requirements of the type of data being stored. For example, data may be stored as simple key/value pairs, as JSON documents, or as a graph consisting of edges and vertices.

The term NoSQL refers to data stores that do not use SQL for queries. Instead, the data stores use other programming languages and constructs to query the data. In practice, "NoSQL" means "non-relational database," even though many of these databases do support SQL-compatible queries. However, the underlying query execution strategy is usually very different from the way a traditional RDBMS would execute the same SQL query.