Data Engineering

Data Engineering

I am coming with a series of posts on Data Engineering, my information is based on the course that I have completed on Coursera "Data Engineering Professional Certificate" Specialization offered by IBM.

These series of posts will help anyone who is looking to get started in Data Engineering or hunting or any kind of information to have an overview of what they're would encounter on their journey to becoming a DE.

Data Engineering - The Start

The modern data ecosystem includes a network of interconnected and continually evolving entities that include:

  1. Data is available in a host of different formats, structures, and sources.
  2. Enterprise Data Environment, in which raw data is staged so it can be organized, cleaned, and optimized for use by end-users.
  3. End-users, such as business stakeholders, analysts, and programmers consume data for various purposes. Emerging technologies such as Cloud Computing, Machine Learning, and Big Data, are continually reshaping the data ecosystem and the possibilities it offers.

Data Engineers, Data Analysts, Data Scientists, Business Analysts, and Business Intelligence Analysts, all play a vital role in the ecosystem for deriving insights and business results from data.

The goal of Data Engineering is to make quality data available for analytics and decision-making. And it does this by collecting raw source data, processing data so it becomes usable, storing data, and making quality data available to users securely.

Technical Skills Required :

  1. operating System: UNIX, LINUX, Windows Administrative Tools, System utilities, and commands
  2. Infrastructure Components: Virtual Machines, Networking, Application Services, Cloud-based Services.
  3. Databases and Data Warehouses a. RDBMS: IBM DB2, MySQL, Oracle, Postgresql b. NoSQL: Redis, MongoDB, Cassandra, Neo4J c. Data Warehouses: Oracle Exadata, IBM, Db2 Warehouse on Cloud, IBM Netezza Performance server, Amazon Redshift.
  4. Data Pipelines: Apache Beam, Airflow, DataFLow
  5. ETL Tools: IBM Infosphere, AWS, Improvado
  6. Languages: Query Languages: SQL - relational and SQL-Like for NoSQL DB Programming Languages: Python, R, Java Shell and Scripting Languages: Unix/Linux and PowerShell.
  7. Big Data Processing Tools: Hadoop, Hive, Apache Spark.

Functional Skills :

  1. Convert business requirements into technical specifications.
  2. Work with the complete s/w dev lifecycle :

Ideation->Arch -> Design -> Prototyping -> Testing -> Deployment ->monitoring.

  1. Understand data potential application in business
  2. Understand risks of poor data management: Data Quality | Data Privacy | Security | Compliance