Unit 5 | Big Data Notes | AKTU Notes


Unit 5 | Big Data Notes | AKTU Notes


    PIG

    Pig is a high-level platform developed by Apache for analyzing large data sets. It 
    uses a language called Pig Latin, which is similar to SQL but is designed for 
    handling large-scale data.

    Types of Pig Execution Modes

    1. Local Mode

    1. In this mode, Pig runs on a single local machine.
    2. It uses the local file system instead of HDFS.
    3. It is mainly used for development and testing purposes with smaller datasets.
    4. There is no need for Hadoop setup in local mode.

    2. MapReduce Mode (Hadoop Mode)

    1. This is the production mode where Pig scripts are converted into MapReduce jobs and executed over a Hadoop cluster.
    2. It supports large datasets that are stored in HDFS.
    3. Requires proper Hadoop setup and configuration.
    4. It provides scalability and fault tolerance.

    Features of Pig

    Ease of Use: Pig Latin language is simple and similar to SQL, making it easier 
    for developers and analysts.
    Data Handling: It can work with both structured and semi-structured data (like 
    logs, JSON, XML).
    Extensibility: Users can write their own functions to handle special 
    requirements (called UDFs).
    Optimization: Pig automatically optimizes the execution of scripts, so users can 
    focus more on logic than performance tuning.
    Support for Large Datasets: It processes massive volumes of data efficiently by 
    converting scripts into multiple parallel tasks.
    Interoperability: It can work with other Hadoop tools like Hive, HDFS, and 
    HBase.

    Grunt Shell

    Grunt is the interactive shell or command-line interface of Pig. It allows users to write and execute Pig Latin commands line by line, similar to a SQL command line or terminal.
    • Useful for testing and debugging Pig Latin scripts.
    • Helps to run small tasks and check output instantly.
    • Automatically starts when you run Pig without any script file.
    • Can load data, process it, and display results interactively.
    • Example Use:
    Analysts use the Grunt shell to experiment with data, apply filters, and view outputs before finalizing their Pig script.

    What are the various syntax and semantics of the Pig Latin programming language?

    Pig Latin is a high-level language used with Apache Pig for data processing. It 
    has specific rules (syntax and semantics) that define how the language should 
    be written and how it behaves.

    1. Statements:
    • A Pig Latin program is made up of multiple statements.
    • Each statement represents an operation or command and usually ends with a 
    semicolon.
    • Comments can be written using double hyphens (--) or C-style comments (/* 
    */).
    • Pig Latin has reserved keywords that cannot be used for naming variables or 
    aliases.
    • Operators and commands are not case-sensitive, but function names and 
    aliases are case-sensitive.

    2. Expressions:
    • Expressions are parts of statements that produce a value.
    • They are used with relational operators in Pig.
    • Pig supports a variety of expressions, including mathematical and string operations.

    3. Types:
    Pig has several data types:
    Simple types: int, long, float, double, bytearray (binary), and chararray (text).
    Complex types:
    - Tuple: an ordered set of fields.
    - Bag: a collection of tuples.
    - Map: a collection of key-value pairs.

    4. Schemas:
    • Schemas define the structure (field names and data types) of a relation.
    • Unlike SQL, Pig allows partial or no schema at all; data types can be inferred later.
    • This makes Pig flexible for handling plain files with no predefined structure

    5. Functions:
    • Pig has built-in functions of four types:
    - Eval functions – for computations.
    - Filter functions – to filter records.
    - Load functions – to load data.
    - Store functions – to save data.
    • If needed, users can create their own custom functions called User Defined 
    Functions (UDFs).
    • The Pig community also shares functions through a repository called Piggy Bank.

    6. Macros:
    • Macros are reusable code blocks within Pig Latin.
    • They make scripts cleaner and help avoid repetition.
    • Macros can be defined inside the script or in separate files and imported when 
    needed.

    User Defined Functions (UDFs) in Pig:

    • UDFs are custom functions created by the user when built-in functions in Pig 
    are not sufficient.
    • They are used to perform specific operations on data like filtering, 
    transformation, or formatting.
    • UDFs are typically written in Java, Python, or other supported languages and 
    can be used in Pig scripts like any other function.
    • Once written and registered in Pig, UDFs help make the script more powerful 
    and flexible.
    Example in simple words:
    If Pig does not have a function to extract only the year from a date field, the 
    user can create a UDF to do that and use it in their script.

    Data Processing Operators in Pig:

    Pig provides several operators to process and transform data. Here are the most 
    common ones:
    1. LOAD – Loads data from the file system (like HDFS) into Pig for processing.
    2. DUMP – Displays the output of a relation on the screen.
    3. STORE – Saves the final result to a file or directory.
    4. FILTER – Removes unwanted rows based on a condition.
    5. FOREACH...GENERATE – Applies a transformation to each row (like selecting specific columns or applying functions).
    6. GROUP – Groups data by a specified field (used for aggregation).
    7. JOIN – Joins two or more datasets on a common field.
    8. ORDER BY – Sorts the data in ascending or descending order.
    9. DISTINCT – Removes duplicate records from the dataset.
    10. LIMIT – Restricts the number of output rows.
    11. UNION – Combines two datasets with the same structure.

    Apache Hive Architecture and Installation

    Hive Architecture is designed to manage and query large datasets stored in 
    Hadoop’s HDFS using a SQL-like language called HiveQL. The key components 
    are:

    Metastore: Stores metadata (like table names, columns, data types, location) 
    in a relational database.
    Driver: Manages the lifecycle of a HiveQL statement (compilation to 
    execution).
    Compiler: Converts HiveQL queries into execution plans (usually MapReduce 
    jobs).
    Execution Engine: Runs the execution plan on Hadoop.
    User Interfaces: Includes Hive CLI, Beeline, Web UI, and HiveServer2.

    Hive Installation

    To install Hive:
    1.First, install and configure Hadoop.
    2.Download Hive from the Apache website.
    3.Extract and configure Hive by setting environment variables.
    4.Set up the Metastore (can use MySQL or Derby).
    5.Initialize the schema using Hive tools.
    6.Start Hive and begin executing queries.

    Hive Shell

    The Hive Shell is a command-line tool where users can:
    • Run HiveQL queries
    • Create and manage tables
    • Load and query data
    • Check outputs and errors
    • It is the most basic way to interact with Hive and is useful for testing and learning.

    Hive Services

    Hive includes several important services:
    HiveServer2: Allows clients to send queries remotely.
    Metastore Service: Handles all metadata operations.
    CLI/Beeline: Command-line interfaces to interact with Hive.
    Web Interface: GUI to manage and run queries (optional).

    Hive Metastore

    The Metastore stores metadata about databases, tables, partitions, and 
    columns. It helps the Hive engine understand the structure of the data. It can 
    be embedded (using Derby for testing) or remote (using MySQL/PostgreSQL for 
    production).

    HiveQL (Hive Query Language)

    HiveQL is a query language similar to SQL used for querying and managing large 
    datasets in Hive. It allows users to write queries to create tables, load data, and 
    perform analysis using simple syntax.

    Examples of what you can do with HiveQL:
    • Create tables
    • Load data into tables
    • Query data using SELECT
    • Perform joins, filtering, grouping, and aggregations

    Tables in Hive

    Hive supports two types of tables:

    1. Managed Tables: Hive controls both the metadata and the data. If you drop the 
    table, data is also deleted.

    2. External Tables: Only metadata is managed by Hive. The data stays in HDFS even if the table is dropped.

    Tables have a schema (columns and data types) and can be partitioned (organized by specific columns for faster queries).

    Querying Data in Hive
    You can query data using HiveQL. You can:
    • Use SELECT to retrieve specific columns
    • Use WHERE to filter records
    • Use GROUP BY to aggregate data
    • Use JOIN to combine tables
    Hive supports basic querying operations similar to SQL but is designed for batch 
    processing, not real-time.

    User Defined Functions (UDFs)
    Hive provides built-in functions for operations like string manipulation, math, date handling, etc.
    If you need a function that is not available, you can create your own UDF. These are 
    custom functions that users write (usually in Java) and then register in Hive to use in 
    queries.

    Example use cases for UDFs:
    • Custom data transformations
    • Special filtering conditions
    • Advanced calculations

    Sorting and Aggregating Data in Hive

    Sorting: Hive supports sorting using the ORDER BY clause. It sorts the complete 
    dataset but is slow for big data.
    Distributed Sorting: Use SORT BY (sorts within partitions) or CLUSTER BY (sort 
    and distribute across reducers).
    Aggregating: Hive supports aggregation using functions like:
    • COUNT() – Counts rows
    • SUM() – Adds values
    • AVG() – Averages values
    • MAX() / MIN() – Gets maximum or minimum values
    Often used with GROUP BY to get results grouped by a column

    MapReduce Scripts in Hive
    • Hive automatically converts your HiveQL queries into MapReduce jobs.
    • You don’t need to write MapReduce code manually to process data.
    • Behind the scenes, when you run a query like SELECT, Hive translates it into a 
    series of MapReduce steps to execute the task in parallel.
    • For advanced processing, Hive allows the use of custom MapReduce scripts 
    (written in Java, Python, etc.) using TRANSFORM clause in HiveQL.
    • This feature is helpful when default HiveQL is not enough, and you need 
    specific processing logic.

    Joins in Hive
    Joins in Hive are used to combine rows from two or more tables based on a 
    related column.
    Common types of joins in Hive:

    1. INNER JOIN: Returns rows that match in both tables.
    2. LEFT OUTER JOIN: Returns all rows from the left table and matching rows from 
    the right.
    3. RIGHT OUTER JOIN: Returns all rows from the right table and matching rows 
    from the left.
    4. FULL OUTER JOIN: Returns all rows when there is a match in one of the tables.
    5. MapJoin: A special join where the smaller table is loaded into memory to 
    speed up the join process. Useful when one table is small.

    Hive joins are similar to SQL joins but work on large-scale datasets using MapReduce.

    Subqueries in Hive
    Subqueries are queries nested inside another query.
    Types of subqueries in Hive:
    • Scalar Subquery: Returns a single value. Used in SELECT or WHERE clauses.
    • IN/NOT IN Subqueries: Used to check if a value exists in the result of another 
    query.
    • EXISTS Subquery: Checks if a subquery returns any rows.
    • Derived Tables (Inline Views): A subquery used in the FROM clause. Acts like a 
    temporary table.
    Hive supports limited subquery usage compared to standard SQL, but commonly 
    used ones like in SELECT, FROM, and WHERE clauses are available.

    HBase Concepts:

    HBase is a NoSQL, distributed, and scalable database built on top of Hadoop HDFS (Hadoop Distributed File System).
    • It stores data in column families rather than rows, which makes it suitable for read/write operations on large datasets.
    Data Model: HBase stores data in tables. Each table is divided into column families, and each column family contains a number of rows with unique row keys.
    • It is optimized for random, real-time read/write operations on large datasets.

    HBase Clients:

    Java API: HBase provides a native Java API for interacting with HBase, which is the most commonly used.
    REST API: A RESTful interface is available for interacting with HBase using HTTP requests.
    Thrift API: A language-agnostic API that allows applications in multiple languages (like Python, C++, etc.) to interact with HBase.
    JDBC Driver: HBase provides a JDBC (Java Database Connectivity) driver for easier integration with SQL-based applications.
    MapReduce Integration: HBase integrates seamlessly with Hadoop’s MapReduce framework for processing large datasets.

    Example:
    • HBase tables are structured with row keys, column families, and columns. For 
    example, a table might represent information about students where each 
    student’s row key is their ID, and the columns might include "name", 
    "address", and "marks".
    • For a table called student_data there might be rows for each student like : 
    Row Key: 123
    Column Family: personal -> Name: John Doe
    Column Family: academic -> Marks: 90

    Features of HBase

    1.Distributed and Scalable: HBase is a distributed NoSQL database designed to handle large volumes of data 
    across many machines. It is horizontally scalable, meaning you can add more nodes to the cluster to increase capacity and throughput.

    2. Real-Time Data Access: HBase provides real-time read and write access to data. It is designed for low-latency access, making it suitable for real-time applications such as online analytics, recommendation engines, and logging systems.

    3. Column-Oriented Storage: Unlike traditional relational databases that store data in rows, HBase stores data in 
    column families. This makes it more efficient for reading and writing large amounts of data by accessing only the required columns, reducing I/O operations.

    4. Fault Tolerant: HBase is built on top of Hadoop’s HDFS, which provides fault tolerance through data replication. If a node fails, data is still accessible from other nodes that have replicated copies of the data.

    5. Automatic Sharding: HBase automatically splits tables into regions and distributes them across the cluster. 
    This automatic sharding allows for scalable storage and processing of large datasets without the need for manual partitioning.

    6. Flexible Schema: HBase provides a flexible schema where columns can be added to a column family at any time, and the schema can evolve as the application grows, making it adaptable to changing requirements.

    7. Strong Consistency: HBase provides strong consistency guarantees within a region. When a write is acknowledged, it is immediately available for reads from any client that requests the data.

    8.Integration with Hadoop Ecosystem: HBase integrates seamlessly with Hadoop, MapReduce, and other Hadoop-based tools, enabling big data processing, analytics, and batch jobs to be run efficiently on the same 
    data stored in HBase.

    Advanced Usage of HBase:
    Data Locality: HBase ensures that data is stored in a way that it can be processed by the local node, reducing network overhead.
    MapReduce Integration: You can use MapReduce jobs to process data stored in HBase, making it suitable for big data processing and analysis.
    Bulk Load: HBase supports bulk loading of data from HDFS into HBase, which is efficient for loading large datasets into HBase tables.
    Real-time Analytics: HBase is commonly used for real-time data analytics due to its ability to support random read/write operations.

    Schema Design in HBase:
    Column Families: Choose the number of column families wisely, as each column family is stored separately, and each one is served by a different set of HBase Region Servers.
    Row Keys: Design your row keys carefully to avoid hot spotting. Row keys should be unique and evenly distributed.
    Avoid Wide Rows: Avoid using row keys that would result in very wide rows because they can cause performance issues.
    Data Model Flexibility: HBase allows for flexible schema design. You can add columns to column families without affecting existing data, which is a significant advantage for rapidly changing data models.

    What is Zookeeper?

    Zookeeper is a centralized service used in big data systems (like Hadoop and 
    HBase) to manage and coordinate distributed systems. It acts like a manager 
    that keeps all the nodes (computers) in a cluster connected, informed, and 
    synchronized.

    How Zookeeper Helps in Monitoring a Cluster:

    Manages Nodes: It keeps track of which machines (nodes) are working and 
    which are not.
    Failure Detection: If a node fails or disconnects, Zookeeper quickly detects it 
    and informs the system.
    Leader Election: In systems where one node needs to act as a leader (like 
    NameNode in Hadoop), Zookeeper helps choose one automatically.
    Keeps Configuration Info: It stores settings and configuration that all machines 
    in the cluster can access.

    How to Build Applications with Zookeeper:

    Use of znodes: Zookeeper stores all information in a hierarchical namespace 
    (like a file system) called znodes. Applications can read and write data to these znodes.
    Watches and Notifications: Applications can set watches on znodes to get 
    notifications when data changes. This is useful for real-time configuration 
    updates.
    Locking and Synchronization: Zookeeper allows distributed locking, which helps in resource sharing and synchronizing actions across distributed applications.
    Group Membership: Applications can use Zookeeper to track group membership, i.e., keeping a record of which services are online and available.
    Naming Service: Zookeeper can be used to manage names of services and provide a lookup mechanism, like a phone directory for distributed services.

    IBM Big Data Strategy:

    IBM's Big Data strategy focuses on helping organizations manage, analyze, and use 
    large volumes of data efficiently. The key elements of the strategy include:
    Volume, Variety, Velocity, and Veracity (4 V’s) of big data.
    Unified platform to integrate data from different sources.
    Real-time analytics and insights.
    Secure and scalable systems to handle enterprise data.
    Integration of AI and machine learning for smarter data processing.

    IBM Infosphere:

    IBM InfoSphere is a data integration platform that provides tools to collect, clean, 
    manage, and govern data. It helps in:
    Data warehousing and data quality management.
    • Connecting structured and unstructured data.
    • Ensuring data security, compliance, and integration across platforms.
    • Supporting ETL (Extract, Transform, Load) processes.

    IBM BigInsights:

    IBM BigInsights is IBM’s Big Data platform built on Apache Hadoop. It is designed 
    for large-scale data analysis and includes:
    • A user-friendly interface for non-technical users.
    Hadoop-based architecture for distributed data processing.
    • Tools for data mining, machine learning, and analytics.
    • Integration with IBM tools like Infosphere and Big SQL.

    IBM Big Sheets:

    Big Sheets is a spreadsheet-style web interface that allows users to work with 
    large datasets easily. It is used in BigInsights and is suitable for:
    • Users who are not familiar with programming.
    • Analyzing large data sets using a visual interface.
    • Performing tasks like sorting, filtering, and charting big data.

    Introduction to Big SQL:

    Big SQL is an IBM technology that allows SQL queries on big data stored in 
    Hadoop. It helps in:
    • Running SQL queries on Hive tables, HBase, and other data sources.
    • Using existing SQL knowledge to analyze big data.
    • Providing high performance, security, and compatibility with traditional 
    databases.
    • Allowing integration with BI tools like IBM Cognos and others.

    No comments:

    Post a Comment