UNIT 4 | Data Warehousing and Data Mining Notes | AKTU Notes



    1. Classification

    Classification is a data mining technique used to assign data into predefined categories or classes based on their attributes. In classification, a model is trained using known data (training dataset) and then used to classify new data.

    Example: Email filtering system – Spam and Not Spam.

    Hinglish Explanation: Classification ka matlab hai data ko predefined categories me divide karna.


    2. Data Generalization

    Data generalization is the process of summarizing detailed data into higher level concepts.

    Example: City → State → Country

    Hinglish: Data generalization me detailed data ko higher level concept me convert kiya jata hai.


    3. Analytical Characterization

    Analytical characterization describes the general characteristics of a target class of data. It summarizes the important features of the data.

    Example: Analyzing characteristics of high-profit customers.


    4. Analysis of Attribute Relevance

    Attribute relevance analysis identifies which attributes are most important for classification or prediction. Some attributes may be irrelevant or redundant.

    • Age
    • Income
    • Location

    5. Mining Class Comparisons

    Class comparison compares different classes of data to identify differences between them.

    Example: Comparing customers who buy a product and customers who do not buy a product.


    6. Statistical Measures in Large Databases

    Statistical measures help analyze large datasets.

    • Mean: Average value of data
    • Median: Middle value
    • Mode: Most frequent value
    • Variance: Measures data spread
    • Standard Deviation: Shows variation from mean

    7. Statistical-Based Algorithms

    These algorithms use statistical models to classify data.

    • Naive Bayes Classifier
    • Logistic Regression

    8. Distance-Based Algorithms

    Distance-based algorithms classify data based on distance between data points.

    • Euclidean Distance
    • Manhattan Distance

    Example Algorithm: K-Nearest Neighbor (KNN)


    9. Decision Tree-Based Algorithms

    Decision tree is a tree structure used for classification and prediction.

    • Root Node
    • Decision Node
    • Leaf Node

    Algorithms: ID3, C4.5, CART


    10. Introduction to Clustering

    Clustering is the process of grouping similar data objects into clusters. Objects in the same cluster are similar while objects in different clusters are dissimilar.


    11. Similarity and Distance Measures

    These measures determine how similar or different data objects are.

    • Euclidean Distance
    • Manhattan Distance
    • Cosine Similarity

    12. Hierarchical Clustering Algorithms

    Hierarchical clustering builds a hierarchy of clusters represented as a tree called a dendrogram.

    • Agglomerative: Bottom-up approach
    • Divisive: Top-down approach

    13. Partitional Algorithms

    Partitional clustering divides data into k clusters.

    Example Algorithm: K-Means


    14. Hierarchical Clustering Methods

    CURE (Clustering Using Representatives)

    • Uses representative points for clusters
    • Handles large datasets
    • Handles outliers

    CHAMELEON

    • Considers interconnectivity
    • Considers closeness between clusters

    15. Density-Based Methods

    Density-based clustering identifies clusters based on dense regions in the dataset.

    • DBSCAN: Detects clusters based on density
    • OPTICS: Handles varying densities

    16. Grid-Based Methods

    Grid-based methods divide the data space into a grid structure.

    • STING
    • CLIQUE

    17. Model-Based Methods

    Model-based clustering assumes that data is generated by a statistical model.

    Example: Gaussian Mixture Models


    18. Association Rules Introduction

    Association rule mining discovers relationships between items in large datasets.

    Example: Customers who buy bread also buy butter.


    19. Large Item Sets

    Large item sets are groups of items that frequently appear together in transactions.

    Example: {Milk, Bread}


    20. Basic Algorithms for Association Rules

    Apriori Algorithm Steps:

    1. Generate candidate item sets
    2. Remove infrequent item sets
    3. Generate frequent item sets

    21. Parallel and Distributed Algorithms

    Parallel and distributed algorithms process data across multiple processors or machines to improve performance.


    22. Neural Network Approach

    Neural networks are machine learning models inspired by the human brain. They are used for classification, prediction, and pattern recognition.


    Conclusion

    Classification, clustering, and association rule mining are important techniques in data mining. These methods help discover hidden patterns and support intelligent decision making in many domains.

    No comments:

    Post a Comment