Data Mining Notes 1
Attribute of data
- Nominal (ID numbers)
- Ordinal (grades)
- Interval (dates)
- Ratio
How to measure the similarity of two objects
Similarity = 1 - Dissimilarity
- Similarity
- Numerical measure of how alike two data objects are
- Is higher when objectsare more alike
- Often falls in the range [0,1]
- Dissimilarity
- Numberical measure of how different re two data objects
- Lower when objects are more alike
- Miinimum Dissimilarity is often 0
- Upper limit varies
Proximity refers to a similarity or dissimilarity
Data Quality
Examples of problems:
- Noise and outliers
- Missing Values
- Duplicate data
Duplicate Data
Data set may include data objects that are duplicates, or almost duplicates of one another
- Major issue when merging data from heterogeous sources
Example: Same person with multiple email addresses
Data Cleaning: Process of dealing with duplicate data issues
Data Preprocessing
- Aggregation: Conbining two or more attributes
- Data reduction
- Change of scale
- More “stable” data
- Sampling
- Dimensionality Reduction
- Feature subset selection
- Discretization
- Attribute Transformation
Sampling
- Sampling is the main technique employed for data selection
- Statisticians sample because obtaining the entire set of data interest is too expensive or time consuming
- Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming
Key principle for effective sampling is
- using a sample will work almost as well as using the entire data sets, if the sample is representative.
- A sample is representative if it has approximately the same property as the original set of data
Types of sampling
- Simple Random Sampling
- Sampling without replacement
- Sampling with replacement
- Stratified sampling
Dimensionality Reduction
???? Need to learn later