KLam's Tech life

Posted 2021-01-31Updated 2025-06-19Notes2 minutes read (About 259 words)

Attribute of data

Similarity = 1 - Dissimilarity

Similarity
- Numerical measure of how alike two data objects are
- Is higher when objectsare more alike
- Often falls in the range [0,1]
Dissimilarity
- Numberical measure of how different re two data objects
- Lower when objects are more alike
- Miinimum Dissimilarity is often 0
- Upper limit varies

Proximity refers to a similarity or dissimilarity

Examples of problems:

Data set may include data objects that are duplicates, or almost duplicates of one another

Example: Same person with multiple email addresses
Data Cleaning: Process of dealing with duplicate data issues

Aggregation: Conbining two or more attributes
- Data reduction
- Change of scale
- More “stable” data
Sampling
Dimensionality Reduction
Feature subset selection
Discretization
Attribute Transformation

Sampling is the main technique employed for data selection
Statisticians sample because obtaining the entire set of data interest is too expensive or time consuming
Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming

using a sample will work almost as well as using the entire data sets, if the sample is representative.
A sample is representative if it has approximately the same property as the original set of data

???? Need to learn later