Data Warehousing and Data Mining Dr.P.rizwan Ahmed

November 22, 2018 | Author: Rizwan Ahmed | Category: Cluster Analysis, Data Warehouse, Data Mining, Statistical Classification, Databases

Share Embed Donate

Report this link

Short Description

Data Warehousing and Data Mining Dr.P.rizwan Ahmed...

Description

DATA WAREHOUSING AND

DATA MINING

(Choice Based Credit System (CBCS) Pattern) – New New Syllabus ( For B. Sc Computer Science, B.Sc., Software Computer Science, B.Sc. ISM, B.Sc. IT, B.Sc. Software System, B.Sc. Software Engineering, BCA, M.Sc. Computer Science, M.Sc. Information Technology, M.Sc. Information System and Management, M.Sc. Software Engineering, MCA, B.E.CSE, B.Tech IT, M.E CSE, M.Tech IT, M.Phil., and IT Professionals.) Professionals.)

By

Dr.P.Rizwan Ahmed,

MCA,, M.Sc.,M.A.,M.Phil.,Ph.D, M.Sc.,M.A.,M.Phil.,Ph.D,

Head of the Department Department of Computer Applications and PG Department of Information Technology Mazharul Uloom College, Ambur - 635 802, Vellore Dist. Tamil Nadu.

CONTENTS Preface Acknowledgement PART- I DATA MINING Chapter – 1

Introduction

1.1 An Expanding universe of data 1.2 Information and production factor 1.3 KDD and data mining 1.4 Data Mining vs query tools 1.5 Data Mining in Marketing 1.6 Practical applications of data mining 1.7 Learning 1.8 Self-learning computer systems 1.9 Machine learning 1.9.1 Why machine learning is done? 1.10 Machine learning and the methodology of science 1.10.1 Differences between Data Mining and Machine Learning 1.11 Concept Learning Summary Review Question Chapter – 2

Data Mining and the Data Warehouse

2.1 Data Warehouse: Definitions 2.2 Why do we need Data Warehouse? 2.3 Designing decision support systems 2.3.1Hardware 2.3.1Hardware and software products of a decision support system 2.4 Integration with data mining 2.5 Client/server and data warehousing 2.6 Multi-processing machines 2.7 Cost justification Summary Review Questions

Chapter – 3

Knowledge Discovery Process

3.1 Introduction 3.2 Data selection 3.3 Cleaning 3.4 Coding 3.5 Data mining 3.5.1 Preliminary analysis of the data set using traditional query tools 3.5.1.1 Visualization techniques 3.5.1.2 Likelihood and distance 3.5.1.3 OLAP tools 3.5.1.4 K-nearest neighbor 3.5.1.5 Decision Trees 3.5.1.6 Association Rules 3.5.1.7 Neural networks 3.5.1.8 Genetic algorithms 3.6 Reporting Summary Review Questions Chapter- 4

KDD Environment

4.1 Different forms of knowledge 4.2 KDD environment 4.3 Ten golden rules Summary Review Questions Chapter 5

Real life applications

5.1 Customer profiling 5.2 Predicting bid behavior of pilots 5.3 Discovering foreign key relationships Summary Review Questions Chapter 6

6.1 Learning as compression of data sets

Formal aspects of learning algorithm

6.2 Information content of a message 6.3 Noise and redundancy 6.4 Significance of noise 6.5 Fuzzy databases 6.6 The traditional theory of the relational database 6.7 From relations to tables 6.7.1 From keys to statistical dependencies 6.8 Denormalization 6.9 Data mining primitives Summary Review Questions Chapter – 7

7.1 Introduction 7.2 Data 7.3 Information 7.4 Knowledge 7.5 Historical Note: Many names of Data Mining 7.6 Data Mining 7.6.1 Some of the definitions of Data Mining 7.7 Why Data Mining 7.8 Why Data Mining is Important? 7.9 Uses of Data Mining 7.10 Data Mining Models 7.10.1 Verification Model 7.10.2 Discovery Model 7.11 Development of data mining 7.12 Applications of Data Mining 7.12.1 Healthcare 7.12.2 Finance 7.12.3 Retail Industry 7.12.4 Telecommunication 7.12.5 Text Mining and Web Mining 7.12.6 Higher Education 7.13 Basic Data Mining Tasks / Taxonomy of data mining tasks 7.13.1 Prediction methods 7.13.2 Descriptive methods 7.14 Data Mining Vs Database 7.15 Data Mining Vs KDD

Data Mining

7.16 Steps in Data Mining Process / Steps involved in KDD 7.17 Architecture of a typical data mining system 7.18 Future Trends 7.18.1 Data Trends 7.18.2 Hardware Trends 7.18.3 Network Trends 7.18.4 Scientific Computing Trends 7.18.5 Business Trends 7.19 Major issues in Data Mining / Data Mining Issues 7.20 Data Mining Metrics 7.21 Social Implications of Data Mining 7.22 Data Mining from a database Perspective Summary Review Question Chapter 8

Advanced Databases

8.1 Various kinds of data / Types of Data 8.1.1  Flat files 8.1.2  Relational Databases 8.1.3 Data Warehouses 8.1.4 Transaction Databases 8.1.5 Object oriented databases 8.1.6 Temporal Databases 8.1.7 Text and Multimedia Databases 8.1.8 Spatial Databases 8.1.9 Time-Series Databases 8.1.10  World Wide Web (WWW) 8.1.11 Heterogeneous databases Summary Review Question Chapter 9

Data Mining Functionalities, Classification and Case Study

9.1 Data Mining Functionalities 9.2 Pattern Interesting / Interestingness of Patterns 9.2.1 Interestingness measures: 9.2.2 Objective vs. subjective interestingness measures 9.3 Classification of Data Mining Systems

9.4 Data Mining Task Primitives 9.5 Why Data Mining Primitives and Languages? 9.6 Integration of data mining system with a database or Data warehouse system 9.6.1 No Coupling 9.6.2 Loose Coupling 9.6.3 Semitight coupling 9.6.4 Tight coupling 9.7 Case Study 9.7.1 Customer Attrition: Case Study 9.7.2 Assessing Credit Risk : Case Study 9.7.3 Successful e-commerce - Case Study Summary Review Question Chapter 10

Overview of Data Mining Techniques-I

10.1 Data Mining Techniques 10.1.1 Cluster Analysis 10.1.2 Induction 10.1.3 Decision Trees 10.1.4 Rule induction 10.1.5 Nearest Neighbour 10.1.6 Neural networks 10.2 Data Mining Application Examples Summary Review Question

Chapter 11

Overview of Data Mining Techniques-II

11.1 Introduction 11.2 A Statistical Perspective on Data Mining 11.2.1 Point Estimation 11.2.2 Models Based on Summarization 11.2.3 Bayes Theorem 11.2.4 Hypothesis Testing 11.2.5 Regression and Correlation 11.3 Similarity Measures 11.4 Decision Trees 11.5 Neural Networks

11.6 Genetic Algorithms Summary Review Question Chapter 12

Data Preprocessing

12.1 1ntroduction 12.2 Why preprocess the data / Need for preprocessing 12.3 Data Preprocessing Techniques / Major Tasks in Data Preprocessing 12.4 Data Cleaning 12.4.1 Missing Data / Values 12.4.1.1 Methods of handling missing data 12.4.2 Noisy Data 12.4.2.1 How to Handle Noisy Data? 12.4.3 Outlier Analysis 12.4.4 Regression 12.5 Data Cleaning as a Process 12.5.1 Discrepancy detection 12.5.2 Discrepancy Detection Tools 12.5.3 Data Transformation 12.5.4 Data Transformation Tools 12.6 Data Integration 12.6.1 Issues to be considered in Data Integration 12.6.1.1 Schema integration 12.6.1.2 Reduction 12.6.1.3 Detecting and resolving data value conflicts 12.6.2 Handling Redundant Data in Data Integration 12.7 Data Transformation 12.7.1 Methods of Data Normalization 12.7.1.1 Min-max normalization 12.7.1.2 z-score normalization 12.7.1.3 Normalization by decimal scaling 12.8 Data Reduction 12.8.1 Data Reduction Strategies 12.8.1.1 Data Cube Aggregation 12.8.1.2 Attribute Subset Selection 12.8.1.3 Dimensionality Reduction

12.8.1.4 Numerosity Reduction 12.8.1.5 Data Discretization and concept hierarchy generation Data discretization 12.9 Data Mining Query Languages (DMQL) Summary Review Questions Chapter 13

Association Rules

13.1 Association Rules 13.2 Large Item sets 13.3 Basic Algorithm 13.3.1 Apriori Algorithm 13.3.2 Partitioning 13.4 Parallel and Distributed Algorithms 13.4.1 Data parallelism 13.4.2 Task parallelism 13.5 Comparing Approaches 13.6 Incremental Rules 13.7 Advanced Association Rule Techniques 13.7.1 Generalized association rules 13.7.2 Multiple-level association rules 13.7.3 Quantitative association rules 13.7.4 Using Multiple Minimum Supports 13.8 Measuring the Quality of Rules Summary Review Questions

Chapter 14

Concept Description: Generalization and Characterization

14.1 Concept Description 14.2 Data Generalization and Summarization-based 14.2.1 Data Generalization 14.2.2 Characterization: Data Cube Approach 14.2.3 Attribute oriented induction for data characterization

14.2.4 Efficient Implementation of Attribute-Oriented Induction 14.3 Analytical characterization: Analysis of attribute relevance 14.4 Mining class comparisons: Discriminating between different classes Mining Class Comparisons 14.5 Descriptive Data Summarization / Mining descriptive statistical measures in large databases 14.5.1 Measuring the Central Tendency 14.5.2 Measuring the Dispersion of Data 14.5.3 Graphics Displays of basic Statistical Description

Summary Review Questions Chapter 15

Mining Frequent Patterns, Associations & Correlations

15.1 Mining Association Rules in Large Databases 15.1.1 Market Basket Analysis: A Motivating Example 15.1.2 Association Rule: Basic Concepts 15.1.3 Association Rule Mining: A Road Map 15.1.4 Mining Frequent Itemsets: the Key Step 15.2 Mining single-dimensional Boolean association rules from transactional databases: Efficient and Scalable Frequent Itemset Mining Methods 15.2.1 Apriori Algorithm 15.2.2 Generating Association Rules from Frequent Itemsets 15.2.3 Methods to Improve Apriori’s Efficiency 15.2.4 Mining Frequent Patterns without Candidate Generation 15.2.5 Principles of Frequent Pattern Growth 15.3 Mining various kinds of Association Rules 15.3.1 Mining multilevel association rules from transactional databases: Multiple-Level Association Rules 15.3.2 Mining multidimensional association rules from transactional databases and data warehouse 15.4 From Association Mining to Correlation Analysis 15.5 Constraint-Based Association Mining

Summary Review Questions Chapter 16

16.1 Introduction

Classification

16.1.1 Classification algorithms based on the categorization: Issues in Classification 16.2 Statistical-Based Algorithms 16.2.1 Regression 16.2.2 Bayesian classification 16.2.3 Naïve Bayes Classifier 16.3 Distance-Based Algorithms 16.3.1 Simply Approach 16.3.2 K Nearest Neighbors 16.4 Decision Tree-Based Algorithms 16.4.1 C4.5 16.4.2 CART 16.4.2.1 Scalable DT techniques 16.5 Neural Network-Based Algorithms 16.5.1 Propagation 16.5.2 NN supervised learning 16.5.3 Radial Basis Function Networks 16.5.4 Perceptron 16.6 Rule-Based Algorithms 16.6.1 Generating Rules from a DT 16.6.2 Generating Rules form a Neural Net 16.6.3 Generating Rules without a DT or NN 16.7 Combining Techniques Summary Review Questions Chatper-17

Classification and Prediction

17.1 Classification 17.1.1 Classification — A Two-Step Process 17.1.2 Prediction 17.1.3 Issues regarding classification and prediction 17.1.4 Comparing Classification and Prediction Methods 17.2 Classification by decision tree induction 17.2.1 Decision Tree Induction 17.2.2 Attribute Selection Measure 17.2.3 Information Gain (ID3/C4.5) 17.2.4 Gini Index (IBM IntelligentMiner) 17.2.5 Extracting Classification Rules from Trees 17.2.6 Avoid Overfitting in Classification

17.2.7 Enhancements to basic decision tree induction 17.2.8 Classification in Large Databases 17.3 Bayesian Classification: Introduction 17.3.1 Bayesian Classification: Why? 17.3.2 Bayesian Classification 17.3.3 Bayesian Theorem 17.3.4 Naïve Bayes Classifier 17.3.5 Bayesian Belief Networks 17.3.6 Training Bayesian Belief Networks 17.4 Rule Based Classification 17.4.1 Using IF-THEN Rules for Classification 17.4.2 Rule Extraction from a Decision Tree 17.4.3 Rule induction using a Sequential Conversing Algorithm 17.4.4 Rule Quality Measures 17.5 Classification by backpropagation 17.6 Classification based on concepts from association rule mining/ Association-Based Classification / Classification by association Rules 17.7 Lazy Learners (or Learning from Your Neighbors) 17.7.1 k-Nearest Neighbor 17.7.2 Case-Based Reasoning (CBR) 17.8 Other Classification Methods 17.8.1 Genetic Algorithms 17.8.2 Rough Set Approach 17.8.3 Fuzzy Sets Approaches 17.9 Prediction 17.10 Classification accuracy 17.10.1 Classification Accuracy: Estimating Error Rates Summary Review Questions Chapter- 18

18.1 Introduction 18.2 Similarity and Distance Measures 18.3 Outliers 18.4 Hierarchical Algorithms 18.4.1 Agglomerative Algorithms 18.5 Partitional Algorithms 18.5.1 Minimum spanning tree 18.5.2 Squared Error Clustering Algorithm

Clustering

18.5.3 K-means clustering 18.5.4 Nearest neighbor algorithm 18.5.5 PAM Algorithm 18.5.5.1CLARA 18.5.5.2 CLARANS 18.5.6 Clustering with genetic algorithms 18.5.7 Clustering With Neural Networks 18.5.7.1 Self-Organizing Feature Maps 18.6 Clustering Large Databases 18.6.1 BIRCH 18.6.2 DBSCAN 18.6.3 CURE Algorithm 18.7 Comparison of Clustering Algorithm Summary Review Questions Chapter 19

Cluster Analysis

19.1 What is Cluster Analysis? 19.2 General Applications of Clustering 19.3 Examples of Clustering Applications 19.4 What is Good Clustering? 19.5 Requirements of Clustering in Data Mining 19.6 Types of Data in Cluster Analysis 19.6.1 Interval-valued variables 19.6.2 Binary Variables 19.6.3 Nominal, Ordinal, and Ratio-Scaled Variables. 19.7 A Categorization of Major Clustering Methods 19.7.1 Major Clustering Approaches 19.8 Partitioning Methods: Basic Concept 19.8.1 K-Means Clustering Method 19.8.2K-Medoids Clustering Method 19.8.2.1 Comparison between K-means and K-medoids 19.8.3 PAM 19.8.4 CLARA 19.9 Hierarchical Methods 19.9.1 Types of Hierarchical Clustering Methods 19.9.1.1 Agglomerative Hierarchical Clustering 19.9.1.2 Divisive Hierarchical Clustering 19.9.2 BIRCH

19.9.3 CURE 19.9.4 ROCK 19.9.5 CHAMELEON 19.10 Density-Based Methods 19.10.1 DBSCAN 19.10.2 OPTICS 19.10.3 DENCLUE 19.11 Grid-Based Methods 19.11.1 STING 19.11.2 WaveCluster 19.11.3 CLIQUE 19.12 Model-Based Clustering Methods 19.12.1 Expectation – Maximization (EM) 19.12.2 Conceptual clustering 19.12.3 Neural network approaches 19.13 Outlier Analysis 19.13.1 Outlier Discovery: Statistical Approaches 19.13.2 Outlier Discovery: Distance-Based Approach 19.13.3 Outlier Discovery: Deviation-Based Approach Summary Review Questions Chapter 20

Advanced Topics (Mining Complex types of data)

20.1 Multidimensional analysis and descriptive mining of complex data objects 20.1.1 Generalization of Structured Data 20.1.2 Generalizing Spatial and Multimedia Data 20.1.3 Generalizing Object Data 20.1.4 Generalization-based Mining of Plan Databases by Divide and Conquer 20.2 Mining Spatial Data Mining 20.2.1 Dimensions and Measures in Spatial Data Warehouse 20.2.2 Mining Spatial Association and Co-location Patterns 20.2.3 Spatial Classification and Spatial Trend Analysis 20.3 Mining multimedia databases 20.3.1 Similarity Search in Multimedia Data 20.3.2 Multidimensional Analysis of Multimedia Data 20.4 Mining time-series and sequence data 20.4.1 Time-series database 20.4.2 Mining Time-Series and Sequence Data: Trend analysis

20.4.3 Estimation of Trend Curve 20.4.4 Discovery of Trend in Time-Series 20.4.5 Multidimensional Indexing 20.4.6 Subsequence Matching 20.4.7 Query Languages for Time Sequences 20.5 Text Mining / Mining text databases 20.5.1 Text Data Analysis and Information Retrieval 20.5.2 Text Indexing Techniques 20.5.3 Text Mining Approaches 20.6 Mining the World-Wide Web / Web Mining Chapter 21

Applications and Trends in Data Mining

21.1 Applications of Data Mining 21.1.1 Data Mining for Financial Data Analysis 21.1.2 Data Mining for Retail Industry 21.1.3 Data Mining for Telecommunication Industry 21.1.4 Biomedical Data Mining and DNA Analysis 21.1.5 Data Mining Applications in Sales/Marketing 21.1.6 Data Mining Applications in Banking / Finance 21.1.7 Data Mining Applications in Health Care and Insurance 21.2 Data mining system products and research prototypes 21.2.1 How to choose a data mining system? 21.2.2 Examples of Data Mining Systems 21.3 Additional themes on data mining 21.3.1 Theoretical Foundations of Data Mining 21.3.2 Statistical Data Mining 21.4 Social impact of data mining 21.5 Trends in data mining Summary Review Questions PART – II DATA WAREHOUSING

Chapter 22

22.1 Introduction 22.2 Characteristics of Data Warehouse

Data warehousing

22.3 Need for Data Warehousing 22.4 Why Separate Data Warehouse? 22.5 Difference between Operational databases and Data Warehouses 22.6 Difference between OLTP and Data warehouse 22.7 Benefits of Data Warehousing 22.8 Future of data warehouse 22.9 Limitations of Data Warehouse 22.10Applications of Data Warehousing 22.11 Advantages of Data Warehousing 22.12 Data Warehousing Tools Summary Review Questions Chapter 23

Data Warehousing Components

23.1 Overall Architecture 23.2 Data warehouse database 23.3 Sourcing, acquisition, cleanup, and transformation tools 23.4 Metadata 23.5 Access tools 23.5.1 Query and reporting tools 23.5.2 Application 23.5.3 OLAP 23.5.4 Data mining 23.6 Data marts 23.7 Data warehouse administration and management Summary Review Questions Chapter 24

From Data warehousing to data mining

24.1 Data warehouse usage 24.1.1 Three kinds of data warehouse applications 24.2 Information processing Online Analytical Processing 24.2.1 Advantages of OLAM 24.2.2 Architecture of On-Line Analytical Mining 24.2.3 Comparison between OLAP and OLAM Summary Review Questions

Chapter 25

Data Warehouse Architecture

25.1 Data Warehouse architecture 25.1.1 Steps for the design and construction of data warehouse 25.1.2 Data Warehouse Design Process 25.1.3 Three – Tier Data Warehouse Architecture 25.1.3.1 Enterprise Warehouse 25.1.3.2 Data Mart 25.1.3.3 Virtual data warehouse 25.2Data warehouse Back-End Tools and Utilities 25.3 Metadata Repository 25.4 OLAP Engine 25.4.1 Relational OLAP (ROLAP) 25.4.2 Multidimensional OLAP (MOLAP) 25.4.3 Hybrid OLAP (HOALP) 25.4.4 Specialized Servers Summary Review Questions Chapter 26

Data Warehouse Implementation

26.1 Data Warehouse Implementation 26.1.1 Efficient Computation of Data Cubes 26.1.2 Cube Operation 26.1.3 Indexing OLAP Data: Bitmap Index 26.1.4 Indexing OLAP Data: Join Indices 26.1.5 Efficient Processing OLAP Queries Summary Review Questions Chapter 27

Mapping the data warehouse to a multiprocessor architecture

27.1 Relational database technology for data warehouse 27.1.1 Types of parallelism 27.1.2 Data partitioning 27.2 Data base architecture for parallel processing 27.2.1 Shared-memory architecture 27.2.2 Shared-disk architecture 27.2.3 Shared-nothing architecture

27.2.4 Combined architecture 27.3 Parallel RDMBS features 27.4 Alternative technologies 27.5 Parallel DBMS Vendors 27.5.1 Oracle 27.5.2 Informix 27.5.4 Sybase 27.5.5 Microsoft Summary Review Questions Chapter 28

Reporting and Query Tools and Applications

28.1Tool categories 28.1.1 Reporting tools 28.1.2 Managed Query Tools 28.1.3 Executive information tools 28.1.4 OLAP tools 28.1.5 Data mining tools 28.2 Need for application 28.3 Cognos impromptu 28.4Applications 28.4.1PowerBuilder Summary Review Questions Chapter 29

On-Line Analytical Processing (OLAP)

29.1 Introduction 29.2 Need for OLAP 29.3 Multidimensional data model 29.3.1 From Tables and Spreadsheets to Data Cubes 29.4 OLAP Guidelines / OLAP Product Evaluation Rules 29.5 Data Warehouse Schema / OLAP Schema 29.5.1 Star Schema 29.5.2 Star Schema Keys 29.5.3 Advantages of Star schema 29.5.4 Snow Flake Schema

29.5.5 Fact Constellation 29.6 Concept hierarchies 29.7 OLAP operation in the Multidimensional Data Model 29.8 Multidimensional versus Multirelational OLAP 29.9 Categorization of OLAP Tools 29.10 OLAP Tools and the Internet 29.11 Difference between OLTP and OLAP 29.12 Comparison of DBMS, OLAP, and Data Mining Summary Review Questions Chapter 30

Security

30.1 Introduction 30.2 Requirements 30.2.1 User Access 30.2.2 Legal Requirements 30.2.3 Audit Requirements 30.2.4 Network Requirements 30.2.5 Data Movement 30.2.6 Documentation 30.2.7 High-Security Environments 30.3 Performance Impact of Security 30.3.1 Views 30.3.2 Data Movement 30.4 Security Impact on Design 30.4.1 Application Development 30.4.2 Database Design 30.4.3 Testing Summary Review Questions Chapter 31

31.1 Introduction 31.2 Definition of Types of System 31.3 Defining the SLA 31.3.1 User Requirements 31.3.2 System Requirements

Service Level Agreement (SLA)

Summary Review Questions

Chapter 32

Operating the data warehouse

32.1 Introduction 32.2 Day-To Day Operations of the Data Warehouse 32.3 Overnight Processing Summary Review Questions Chapter 33

Capacity Planning

33.1 Process 33.2 Estimating the Load 33.2.1 Initial Configuration 33.2.2 How much CPU bandwidth 33.2.3 How Much Memory 33.2.4 How much disk? Summary Review Questions Chapter 34

Tuning and testing the data warehouse

34.1 Tuning the Data Load 34.2 Prioritized Tuning Steps 34.3 Tuning Queries 34.3.1 Fixed queries 34.3.2 AD HOC queries 34. 4 Testing the Data Warehouse 34.4. 1 Introduction 34.4.2 The Testing Terminologies 34.4.3 Testing the operational environment 34.4.5 Testing the database 34.4.5.1 Testing database manager and monitoring tools 34.4.5.2 Testing database features 34.4.5.3 Testing database performance 34.5 Testing the Application Summary

Review Questions

Chapter 35

35.1 Introduction 35.1.1 Types of Backup 35.2 Data Warehouse Recovery Models 35.3 Define Backup and Recovery Strategy 35.4 Security Impact on Design of Data Warehouse 35.4.1 Application Development 35.4.2 Database Design 35.4.3 Testing 34.5 Disaster Recovery Summary Review Questions APPENDIX A; Glossary APPENDIX B: Two marks Questions with Answers APPENDIX C: Past University Question Papers BIBLIOGRAPHY

Backup and Recovery

Data Warehousing and Data Mining Dr.P.rizwan Ahmed

Short Description

Description

Comments

We need your help!