UNIT 3 | Data Warehousing and Data Mining Notes | AKTU Notes



    1. Overview of Data Mining

    Data mining is the process of discovering useful patterns, knowledge, and relationships from large datasets using techniques from statistics, machine learning, and database systems.

    It helps organizations analyze large volumes of data and extract meaningful information for decision making.

    Hinglish Explanation: Data mining ka matlab hai large data se useful information aur patterns nikalna. Example: customer buying behavior analyze karna.


    2. Motivation of Data Mining

    The motivation of data mining comes from the rapid growth of data in modern organizations. Traditional database systems cannot efficiently analyze large datasets.

    Main reasons for using data mining:

    • Huge amount of stored data
    • Need for better decision making
    • Business intelligence and prediction
    • Discover hidden patterns

    Hinglish Explanation: Companies ke paas bahut zyada data hota hai, us data se useful knowledge nikalne ke liye data mining use ki jati hai.


    3. Definition and Functionalities of Data Mining

    Definition: Data mining is the process of extracting hidden patterns, relationships, and useful information from large databases.

    Main Functionalities:

    • Classification
    • Clustering
    • Association rule discovery
    • Prediction
    • Outlier detection
    • Data summarization

    Hinglish: Data mining ke through hum data ko analyze karke patterns aur relationships find karte hain.


    4. Data Processing

    Data processing is the step where raw data is converted into meaningful information before applying data mining algorithms.

    Steps include:

    • Data collection
    • Data cleaning
    • Data integration
    • Data transformation
    • Data reduction

    Hinglish: Data processing me raw data ko clean aur organize karke analysis ke liye ready kiya jata hai.


    5. Forms of Data Preprocessing

    Data preprocessing is the process of preparing data before applying mining techniques.

    Main forms:

    • Data cleaning
    • Data integration
    • Data transformation
    • Data reduction

    Preprocessing improves the accuracy and efficiency of data mining results.


    6. Data Cleaning

    Data cleaning is the process of removing errors, missing values, and inconsistencies from the dataset.

    Types of data problems:

    • Missing values
    • Noisy data
    • Inconsistent data

    7. Handling Missing Values

    Missing values occur when some data fields are empty.

    Methods to handle missing values:

    • Ignore the tuple
    • Fill with mean or median value
    • Use regression methods
    • Use machine learning prediction

    8. Noisy Data

    Noisy data refers to incorrect or random errors present in the dataset.

    Techniques to handle noisy data:

    • Binning
    • Clustering
    • Regression
    • Computer and human inspection

    9. Binning

    Binning is a technique where data values are divided into groups called bins and then smoothed to reduce noise.

    Types of binning:

    • Bin mean
    • Bin median
    • Bin boundaries

    10. Clustering

    Clustering is the process of grouping similar objects into clusters so that objects in the same group are more similar to each other.

    Example: grouping customers based on buying behavior.


    11. Regression

    Regression is a statistical technique used to model relationships between variables and predict future values.

    Example: predicting sales based on past data.


    12. Computer and Human Inspection

    In this method, both automated algorithms and human experts analyze data to identify errors and inconsistencies.


    13. Inconsistent Data

    Inconsistent data occurs when data values contradict each other or violate integrity constraints.

    Example: different addresses for the same customer.


    14. Data Integration

    Data integration is the process of combining data from multiple sources into a single dataset.

    Example: integrating sales data from multiple branch databases.


    15. Data Transformation

    Data transformation converts data into appropriate formats for mining.

    Common techniques:

    • Normalization
    • Aggregation
    • Generalization

    16. Data Reduction

    Data reduction reduces the size of dataset while maintaining important information.

    Techniques:

    • Data cube aggregation
    • Dimensionality reduction
    • Data compression
    • Numerosity reduction

    17. Data Cube Aggregation

    Data cube aggregation reduces data by summarizing values along dimensions.

    Example: summarizing daily sales into monthly sales.


    18. Dimensionality Reduction

    Dimensionality reduction reduces the number of attributes in the dataset while preserving useful information.

    Example: feature selection methods.


    19. Data Compression

    Data compression reduces the storage space required for data while maintaining essential information.


    20. Numerosity Reduction

    Numerosity reduction replaces large datasets with smaller representations such as models or summaries.


    21. Discretization

    Discretization converts continuous data values into discrete intervals.

    Example: converting age values into age groups.


    22. Concept Hierarchy Generation

    Concept hierarchy organizes data into different levels of abstraction.

    Example: City → State → Country hierarchy.


    23. Decision Tree

    A decision tree is a classification technique used in data mining that represents decisions and possible outcomes in a tree-like structure.

    Main components:

    • Root node
    • Decision nodes
    • Leaf nodes

    Example: Predicting whether a customer will buy a product based on age and income.


    Conclusion

    Data mining helps organizations extract valuable knowledge from large datasets. Techniques such as data preprocessing, data reduction, clustering, regression, and decision trees help improve data analysis and support better decision making.

    No comments:

    Post a Comment