Posted 2021-01-31Updated 2025-06-19Notes2 minutes read (About 259 words)

Attribute of data

Nominal (ID numbers)
Ordinal (grades)
Interval (dates)
Ratio

How to measure the similarity of two objects

Similarity = 1 - Dissimilarity

Similarity
- Numerical measure of how alike two data objects are
- Is higher when objectsare more alike
- Often falls in the range [0,1]
Dissimilarity
- Numberical measure of how different re two data objects
- Lower when objects are more alike
- Miinimum Dissimilarity is often 0
- Upper limit varies

Proximity refers to a similarity or dissimilarity

Data Quality

Examples of problems:

Noise and outliers
Missing Values
Duplicate data

Duplicate Data

Data set may include data objects that are duplicates, or almost duplicates of one another

Major issue when merging data from heterogeous sources

Example: Same person with multiple email addresses
Data Cleaning: Process of dealing with duplicate data issues

Data Preprocessing

Aggregation: Conbining two or more attributes
- Data reduction
- Change of scale
- More “stable” data
Sampling
Dimensionality Reduction
Feature subset selection
Discretization
Attribute Transformation

Sampling

Sampling is the main technique employed for data selection
Statisticians sample because obtaining the entire set of data interest is too expensive or time consuming
Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming

Key principle for effective sampling is

using a sample will work almost as well as using the entire data sets, if the sample is representative.
A sample is representative if it has approximately the same property as the original set of data

Types of sampling

Simple Random Sampling
Sampling without replacement
Sampling with replacement
Stratified sampling

Dimensionality Reduction

???? Need to learn later

Posted 2021-01-29Updated 2025-06-19Notes3 minutes read (About 457 words)

Distributed and Cloud Computing Notes 1

Reasons for Distributed Systems

Functional Separation
- Different Capabilities and purposes
Inherent Distribution
- Information
- People
Power imbalance and load variation
Reliability
Economies

Consequences of Distributed Systems

Concurrency - Each computer is autonomous
- Carry our tasks independently
- Tasks coordinate their actions by exchanging messages
- System capacity can be increased by adding more resources
No global clock
Independent Failures

Motivation of Distributed Systems

To share resource and information

Trends in Distributed Systems

The emergence of pervasive networking technology
The emergence of mobile and ubiquitous computing
The increasing demand for multimedia services
The view of distributed systems as a utility

Maintenance of intranet

No rick if no connection to internet
Firewalls are used to limit services from/to an intranet
- Limit FTP/Remote Desktop etc.

Mobile computing: Performing computing tasks while the user is on the move, away from his/her usual environment

Eight forms of transparency

Access transparency
Location transparency
Concurrency transparency
Replication transparency
Failure transparency
Mobility transparency
Performance transparency
Scaling transparency

List of Challenge

Heterogeneity
Security
- Confidentiality
  - Protection against disclosure to unauthorized individual information
- Integrity
  - Protection against alteration or corruption
- Availability
  - Protection against interference targeting access to the resources
  - DDoS
- Authenticity or Non-repudiation
  - Proof of sending / receiving an information
  - digital signature

Failure

Availability =MTTF/(MTTF+MTTR)

Mean time to failure(MTTF)
- The average time of normal operation before the system fails
Mean time to repair (MTTR)
- The average time it takes to repair the system and restore it to working condition

Single point failure

Single hardware/Software component failures cause the whole system crash. The key approach to enhancing availability is to make as many as possible partial failures by removing single points of failure

Checkpointing

The process of periodically saving the stage of an executing program to stable storage, from which the system can recover after a failure.
- Each program stae saved is called a Checkpoint .
- Checkpointing can be realized by operating system at kernel level/Third party library/by the application itself.

Jobs

Serial Jobs: Run on a single node
Parallel jobs: use multiple nodes
Interactive jobs: require fast turnaround time, and their input/output is directed to a termainal
Batch jobs: need more resources and don’t need immediate responses. Scheduled jobs.

job Management System

A user server: Let user submit jobs.
A job scheduler: performs job scheduling
A resource manager: allocates and monitors resources. Enforces scheduling policies, and collects accounting information.

Security Mechanisms

Encryption(AES, RSA)
Authentication(Password, Public key)
Authorization(access control)

Concurrency
- Fair scheduling
- Preserve dependencies
- Avoid deadlocks
- Object locking, data consistency, semaphores
Fault tolerance (No failure despite faults)
- Fault detection
  - Checksums
  - Heartbeat
- Fault masking
  - Retransmission of corrupted messages
  - Redundancy
- Fault toleration
  - Exception handling
  - Timeouts
- Fault recovery
  - Rollback mechanisms
Scalability
Openness
Distribution transparency <= Do not let other touch

Posted 2021-01-24Updated 2025-06-19Notes2 minutes read (About 247 words)

Information Security Notes 1 - Classical Crypto System

What is Encryption

Encryption is composed of a key and an Encryption algorithm.

Type of operations used for transforming plaintext to ciphertext
The number of keys used
The way in which the plaintext is processed

Encryption and Decryption

Unencrypted message = plaintext/message
Encrypted message = cipher/ciphertext

Cryptanalysis

Means attack

Brute-force
- Tries every possible
Breaking the algorithm
- Tries to exploit the weakness of the encryption algorithm

How to measured by the following dimensions

Attacker models
- How strong is the attacker
1. Ciphertext only attacks
2. Known plaintext attacks
3. Chosen plaintext attacks(Attacker can choose plaintext on his own)
4. Chosen ciphertext attacks(Attacker can choose the cipher and obtain the plaintext)
Security Goal
- What Goals does your attacker wants to achieve
1. Computationally secure: The cost of breaking the cipher exceeds the value of the encryptited information
2. Unconditionally secure: No matter how much time as opponent has, it is impossible for people decrypt. (secure against brute-force)
Assumptions:
- What is the computational limitation
- Always better to over-estimate the ability of your attackers
1. Computation: attacker might have many computing resource(super-computers)
2. Network: attacker might have control over the network/communication channel, they can send/drop/inject/view your packet
3. Some problems are hard(NP=/=P), no polynomial time solutions
4. We generally assume computation requiring 2^80 is unsolvable

Brute-force attack

Attackers try all possible sets of keys
By probability, it has to try at least half of them
We generally assume computation requiring 2^80 is unsolvable

Posted 2020-12-28Updated 2025-06-19Notesa minute read (About 179 words)

Algorithm - Dynamic Programming

What is Dynamic Programming?

Principle of Optimality (Bellman, 1957):
An optimal sequence of decisions has the property that whatever the initial state and decision are, the remaining decisions must constitute an optimal decision with regard to the state resulting from the first decision.

The Other Algorithmic Design Philosophy

Divide-and-Conquer:
The problem is divided and the subproblems are processed in a recursive manner, but the solutions of Divide-and-Conquer subproblems are usually not repeated, and when they are repeated, the same subproblems are usually recalculated.
Greedy Approach:
At each stage, starting from a certain starting point, each input is checked one by one to see if it is suitable to be added to the answer. If it is not possible to find a selection procedure to check one by one for the optimization problem to be handled, we will discuss it later.

Divide and Conquer and Dynamic Programming are very similar. The difference is that Dynamic Programming’s subproblems have many overlaps, which can be stored in a table without recalculation, exchanging space for time.

Posted 2020-12-26Updated 2025-06-19Notesa few seconds read (About 24 words)

Internetional Relationship - Tools and Target

Tools

Political Diplomacy
Economy and Trade
Military
Cultural Promotion
Spy

Target

Attitude
Weakening Enemy
Seize benefits
Eliminate the opponent

Posted 2020-12-21Updated 2025-06-19Notesa few seconds read (About 10 words)

Algorithm - Master Theorem for exam

Master Theorem for Exam

Posted 2020-12-18Updated 2025-06-19Notes3 minutes read (About 500 words)

Networking Notes 6 - Network Performance Measures

Network Performance Measures

Throughput
- How much data travels across the network
Average packet delay
- How long does it take for a packet to arrive at its destination
- How responsive is the system to user commands
- Can the network support real-delivery such as audio and video

Routing Algorithm

“Good” path:

The minimum cost path
Other definitions are possible

Hop count

The number of routers a packet must go through

Least-cost routing

Associate a cost with each link

Bandwidth
Delay
Reliability

To make routing decision:

Topology of the network
Traffic load
Link cost
Information should always keep up-to-date
More information

Fixed Routing

Central Directory is usually stored at a network control center
Matrix table
No different between Datagram and VC
Advantages
- Simplicity
- Work Well in a reliable network with a stable load
  (IPLC?)
Disadvantages
- Lack of flexibility
- Do not react to network congestion or failures

Flooding Algorithm

Packet sent to every neighbor node

No network information required
Evenually a number of copies will arrive at destination
Guarantees the packet reaches the destination in the shortest time
Duplicate packets are generated.
Node can remember packets already forwarded
Advantage
- Very rebust
- Used for sending emergency messages
Disadvantage
- High traffic load

Random Algorithm

The outgoing link is chosen at random
Round-robin?

Advantages
- Robust and simple
- no network information
- Less traffic load compare to flooding
Disadvantages
- Longer path
- Performance not guaranteed

Adaptive Routing

Used by almost all packet switching network
** routing decision changes as network conditions change**

Failure Link
Network congestion
Require network information
Network information must be exchanged among the nodes
Decisions are more complex
Advantages
- Impoved performance
- Congestion control
Disadvantages
- Substantial processing burden on nodes
- Increased network traffic traffic due to the exchange of network information
- Reacting too quickly can cause oscillation

Distance Vector Routing

Each node communicates only with directly-attached neighborsA routing table is created by building up a common set of routers in close proximity to each other, hence the term “distributed”.
Each router on the network must maintain a two-dimensional Vector Table, which records the best known distance from its own router to each router.
A router periodically builds a routing table by exchanging vector tables with neighboring routers (not broadcast to all routers).
When a router receives a vector table from its neighbors and then corrects its own vector table, the contents of the vector table are continually corrected and retransmitted, and the entire network state is gradually passed to each router.
As the routing tables on the routers become more complete, it becomes possible to find the best path. The vector table is sent only to the neighboring routers, which consumes less bandwidth and does not cause broadcast oscillation.

Link State Routing

Information sharing - each router shares its knowledge of its neighbourhood with all routers in the network.
After all LASs from all nodes are gathered, the entire map of the network can be constructed.

Least-Cost Algorithms

Dijkstra’s Algorithm
Bellman - Ford Algorithm

We Will come back for the Algorithm later

Posted 2020-12-18Updated 2025-06-19Notes2 minutes read (About 346 words)

Networking Notes 5 - Switching

Pactket-Switched Networks: Routing

Two Main approaches to implement packet switching

Datagram approach
Virtual circuit approach

Datagram Network

connection-less network
destination address determines next hop
routes may changee during session

Datagram Approach

Connectionless switching
Each packet(datagram) is treateed indeependently
Packets may go by different paths across the network
May arrive out of order, the transport layer neeeds to reorder them

Virtual circuit network

Connection-oriented switching
Call request and call accept packets are used to establish the connection between sender and receiver
Each packet carries tag(virtual circuit ID), tag determines next hop
fixed path determined at call setup time, remains fixed thru call
routers maintain per-call state
Path is fixed

Virtual Circuit approach

All packets of the same message are transferred via a preplanned route(same route)
Arrive in order
Eatablish between sender and receiver at the beginning

There are two main virtual circuit

Permanent virtual circuit(PVC)

Set up by the network provider
No need to terminate the VC after transmission, and no need to set up the VC before it(because it is already seted up by network provider)

Switch virtual circuit

Setup everytime when a VC is needed, and terminate after the transmission
May get different VC according to network conditions
More flexible, but required extra set up time before data transfer begins

Connection establishment
Data transfer
Connection release

VC vs Datagram

VC

sequencing and error control
No routing decisions to make -> Packets are forwarded more quickly
Less reliable

Datagram

No call setup phase - Good for bursty data
More flexible
- can have alternate route
- can avoid congested parts of the network

Switching

Switching in computer network helps in deciding the best route for data transmission if there are multiple paths in a larger network
One-to-One connection

Packet Switching

The internet is a packet switched network
Message is broken into individual (Packets)
Each packet is sent individually
Each packet will have source and destination IP address with sequence number.
sequence number will help receiver
- Reorder the packets
- Detect missing packets
- Send acknowledgments

Posted 2020-12-18Updated 2025-06-19Notes2 minutes read (About 333 words)

Networking Notes 4 - Network support group

Network Layer

Network layer is concerned with getting packets from source all the way to destinations.

It is the lowest layer that deals with end-to-end transmission, and may make many hops at intermediate nodes.

Physical and data link layers only operate locally.

Hop - is one portion of the path between source and destination.

Network-Layer Function

Connect heterogeneous networks together to look like a single network (Internetworking)
Uniquely identify each device (addressing)
Make decision to deliver data (routing)
Fit data to size used by the lower layer protocol (fragmentation)

Jobs for sending side: encapsulates segments into datagrams

Jobs for receiving side: delivers segments to transport layer

Address in TCP

Physical address

Logical address

Port address

Physical Address

Address of a node as defined by its LAN or WAN

In a data link level, frame transfer only contains Physical addresses in the header.

Most LAN use a 48-bit(6bytes) physical address written as 12 hexadecimal digits with every digits separated by a colon.

Physical Address also called Media Access Control (MAC) address

MAC address is performed via the Address Resolution Protocol (ARP)

Logical Address

Universal address that can pass through the LAN boundaries.

Address of a LAN. Logical Address + Physical Address needed to send data to another LAN network.

Port Address

Computer can run multiple processes at the same time, but need to label different processes by using port address

Port address is 16-bit address. (2^16) 0 - 65535, 65536 in total.

DNS

Domain Name System

Translation of a hostname into an IP address.

Posted 2020-12-18Updated 2025-06-19Notesa minute read (About 176 words)

Networking Notes 3 - Internet Transport-Layer Protocol

More than one transport-later protocol available to applications

Internet: TCP and UDP

TCP

Reliable, in-order delivery
Connection setup
Flow control
Congestion control

UDP

Unreliable, unordered delivery

More about TCP

Full name: Transmission Control Protocol

Connection-oriented protocol
Provide a reliable unicast end-to-end byte stream over an unreliable internetwork

Before any data transfer, TCP establishes a connection(hand shark?)

Client: send Request a connection
Server: Accept a connection
Do Data Transfer
Client: Disconnect

Reliable

The reason why we use TCP is because it is reliable.
Byte stream is broken up into chunks - segments
The receiver sends acknowledgements (ACKs) for segments.
TCP maintains a timer, if ACK is not received in time, the segment is retransmitted.
Detecting errors (checksums for header and data)
Each byte has a sequence number

UDP

Full name : User Datagram Protocol

Connectionless - no handshark
Unreliable
No flow control
No congestion control

Good:

No connect establishment ← can add delay
simple
small segment header
no congestion control: as fast as desired

Good for: Streaming multimedia application

Loss tolerant
rate sensitive
DNS