Big Data Platforms

Big Data Platforms

V's of Big Data :

  1. Velocity: Velocity is the speed at which data accumulates. Data is being generated extremely fast, in a process that never stops. Near or real-time streaming, local, and cloud-based technologies can process information very quickly.
  2. Volume: Volume is the scale of the data or the increase in the amount of data stored. Drivers of volume are the increase in data sources, higher resolution sensors, and scalable infrastructure.
  3. Variety: Variety is the diversity of the data. Structured data fits neatly into rows and columns, in relational databases while unstructured data is not organized in a pre-defined way, like Tweets, blog posts, pictures, numbers, and videos. Variety also reflects that data comes from different sources, machines, people, and processes, both internal and external to organizations.
  4. Value: Value is our ability and needs to turn data into value, it is not just profits. It can have medical or social benefits as well as customer, employee, or personal satisfaction.
  5. Veracity: Veracity is the quality and origin of data and its conformity to facts and accuracy. Attributes include consistency, completeness, integrity, and ambiguity. Drivers: Drivers are mobile technologies, social media, wearable technologies, geo technologies, video, and many, many more.

Data Processing Tools:

We have Big Data tools to help us work with different types of data ( Structures, Semi-Structured and Un-structured)

Open-source big data technologies and the roles they play:

  1. Apache Spark
  2. Apache Hive
  3. Apache Hadoop

Hadoop:

A collection of tools that provides distributed storage and processing of big data

  1. distributed storage and processing of large datasets across clusters of computers
  2. A node is a single computer and a collection of nodes form a cluster.
  3. Hadoop can scale up from a single node to multiple nodes, each offering local storage and computation.
  4. Hadoop provides a reliable, scalable, and cost-effective solution for storing data with no format requirements.
  5. Better real-time data-driven decisions
  6. Improved data access and analysis: provides real-time, self-service access to stakeholders.
  7. Data offload and consolidation: optimizes and streamlines costs by consolidating data, including cold data, across the organization. Hadoop distributed File Systems ( HDFS) is a storage system for big data that runs on multiple commodity hardware connected through a network.
  8. Provides scalable and reliable big data storage by partitioning files over multiple nodes.
  9. Splits large files across multiple computers, allowing parallel access to them.
  10. Replicates file blocks on different nodes to prevent data loss.
  11. Fast recovery from hardware failures, HDFS is built to detect faults and automatically recover.
  12. Access to streaming data, HDFS Supports high data throughput rates.
  13. Accommodation of large datasets, HDFS can scale to hundreds of nodes or computers in a single cluster.
  14. Portability, HDFS is portable across multiple hardware platforms and compatible with a variety of underlying operating systems. Example : Consider a file that includes phone numbers for everyone in the United States; the numbers for people with last names starting with A might be stored on server 1, B on server 2, and so on. With Hadoop, pieces of this phonebook would be stored across the cluster. To reconstruct the entire phonebook, your program would need the blocks from every server in the cluster. HDFS also replicates these smaller pieces onto two additional servers by default, ensuring availability when a server fails, In addition to higher availability, this offers multiple benefits. It allows the Hadoop cluster to break up work into smaller chunks and run those jobs on all servers in the cluster for better scalability.

Hive:

Data warehouse for data query and analysis built on top of Hadoop

  1. Open-source data warehouse s/w for reading, writing, and managing large dataset files that are stored directly in either HDFS or other data storage systems such as Apache HBase.
  2. Since Hive is built on Hadoop and Hadoop is built for long sequential scans, Hive is not suitable for applications that need fast response times.
  3. Hive is read-based and not suitable for transaction processing that involves a high percentage of write operations
  4. Hive is better suited for Data warehousing tasks such as ETL, Reporting, and Analysis.

Spark:

Distributed analytics framework for complex real-time data analytics. General purpose data processing engine designed to extract and process large volumes of data from a wide range of applications

  1. Interactive Analytics
  2. Streams Processing
  3. Machine Learning

Attributes:

  1. Has in-memory processing which significantly increases the speed of computations
  2. Provides interfaces for major programming languages such as Java, Scala, Python, r, and SQL.
  3. Can run using its standalone clustering technology
  4. can also run on top of other infrastructures, such as Hadoop.
  5. Can access data in a large variety of data sources, including HDFS and Hive
  6. Processes streaming data fast
  7. Performs complex analysis in real-time.

SUMMARY

Big Data refers to the vast amounts of data that is being produced each moment of every day, by people, tools, and machines. The sheer velocity, volume, and variety of data challenged the tools and systems used for conventional data, leading to the emergence of processing tools and platforms designed specifically for Big Data.

Big Data processing technologies help derive value from big data. These include NoSQL databases and Data Lakes and open-source technologies such as Apache Hadoop, Apache Hive, and Apache Spark.

  • Hadoop provides distributed storage and processing of large datasets across clusters of computers. One of its main components, the Hadoop File Distribution System, or HDFS, is a storage system for big data.
  • Hive is a data warehouse software for reading, writing, and managing large datasets.
  • Spark is a general-purpose data processing engine designed to extract and process large volumes of data.