Data Mining Notes 1

Attribute of data

  • Nominal (ID numbers)
  • Ordinal (grades)
  • Interval (dates)
  • Ratio

How to measure the similarity of two objects

Similarity = 1 - Dissimilarity

  • Similarity
    • Numerical measure of how alike two data objects are
    • Is higher when objectsare more alike
    • Often falls in the range [0,1]
  • Dissimilarity
    • Numberical measure of how different re two data objects
    • Lower when objects are more alike
    • Miinimum Dissimilarity is often 0
    • Upper limit varies

Proximity refers to a similarity or dissimilarity

Data Quality

Examples of problems:

  • Noise and outliers
  • Missing Values
  • Duplicate data

Duplicate Data

Data set may include data objects that are duplicates, or almost duplicates of one another

  • Major issue when merging data from heterogeous sources

Example: Same person with multiple email addresses
Data Cleaning: Process of dealing with duplicate data issues

Data Preprocessing

  • Aggregation: Conbining two or more attributes
    • Data reduction
    • Change of scale
    • More “stable” data
  • Sampling
  • Dimensionality Reduction
  • Feature subset selection
  • Discretization
  • Attribute Transformation

Sampling

  • Sampling is the main technique employed for data selection
  • Statisticians sample because obtaining the entire set of data interest is too expensive or time consuming
  • Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming

Key principle for effective sampling is

  1. using a sample will work almost as well as using the entire data sets, if the sample is representative.
  2. A sample is representative if it has approximately the same property as the original set of data

Types of sampling

  • Simple Random Sampling
  • Sampling without replacement
  • Sampling with replacement
  • Stratified sampling

Dimensionality Reduction

???? Need to learn later

Author

Elliot

Posted on

2021-01-31

Updated on

2023-05-07

Licensed under