Data Mining Notes 1

Attribute of data

  • Nominal (ID numbers)
  • Ordinal (grades)
  • Interval (dates)
  • Ratio

How to measure the similarity of two objects

Similarity = 1 - Dissimilarity

  • Similarity
    • Numerical measure of how alike two data objects are
    • Is higher when objectsare more alike
    • Often falls in the range [0,1]
  • Dissimilarity
    • Numberical measure of how different re two data objects
    • Lower when objects are more alike
    • Miinimum Dissimilarity is often 0
    • Upper limit varies

Proximity refers to a similarity or dissimilarity

Data Quality

Examples of problems:

  • Noise and outliers
  • Missing Values
  • Duplicate data

Duplicate Data

Data set may include data objects that are duplicates, or almost duplicates of one another

  • Major issue when merging data from heterogeous sources

Example: Same person with multiple email addresses
Data Cleaning: Process of dealing with duplicate data issues

Data Preprocessing

  • Aggregation: Conbining two or more attributes
    • Data reduction
    • Change of scale
    • More “stable” data
  • Sampling
  • Dimensionality Reduction
  • Feature subset selection
  • Discretization
  • Attribute Transformation

Sampling

  • Sampling is the main technique employed for data selection
  • Statisticians sample because obtaining the entire set of data interest is too expensive or time consuming
  • Sampling is used in data mining because processing the entire set of data of interest is too expensive or time consuming

Key principle for effective sampling is

  1. using a sample will work almost as well as using the entire data sets, if the sample is representative.
  2. A sample is representative if it has approximately the same property as the original set of data

Types of sampling

  • Simple Random Sampling
  • Sampling without replacement
  • Sampling with replacement
  • Stratified sampling

Dimensionality Reduction

???? Need to learn later

Distributed and Cloud Computing Notes 1

Distributed and Cloud Computing Notes 1

Reasons for Distributed Systems

  • Functional Separation
    • Different Capabilities and purposes
  • Inherent Distribution
    • Information
    • People
  • Power imbalance and load variation
  • Reliability
  • Economies

Consequences of Distributed Systems

  • Concurrency - Each computer is autonomous
    • Carry our tasks independently
    • Tasks coordinate their actions by exchanging messages
    • System capacity can be increased by adding more resources
  • No global clock
  • Independent Failures

Motivation of Distributed Systems

  • To share resource and information
  • The emergence of pervasive networking technology
  • The emergence of mobile and ubiquitous computing
  • The increasing demand for multimedia services
  • The view of distributed systems as a utility

Maintenance of intranet

  • No rick if no connection to internet
  • Firewalls are used to limit services from/to an intranet
    • Limit FTP/Remote Desktop etc.

Mobile computing: Performing computing tasks while the user is on the move, away from his/her usual environment

Eight forms of transparency

  • Access transparency
  • Location transparency
  • Concurrency transparency
  • Replication transparency
  • Failure transparency
  • Mobility transparency
  • Performance transparency
  • Scaling transparency

List of Challenge

  • Heterogeneity
  • Security
    • Confidentiality
      • Protection against disclosure to unauthorized individual information
    • Integrity
      • Protection against alteration or corruption
    • Availability
      • Protection against interference targeting access to the resources
      • DDoS
    • Authenticity or Non-repudiation
      • Proof of sending / receiving an information
      • digital signature

Failure

Availability =MTTF/(MTTF+MTTR)

  • Mean time to failure(MTTF)
    • The average time of normal operation before the system fails
  • Mean time to repair (MTTR)
    • The average time it takes to repair the system and restore it to working condition

Single point failure

Single hardware/Software component failures cause the whole system crash. The key approach to enhancing availability is to make as many as possible partial failures by removing single points of failure

Checkpointing

  • The process of periodically saving the stage of an executing program to stable storage, from which the system can recover after a failure.
    • Each program stae saved is called a Checkpoint .
    • Checkpointing can be realized by operating system at kernel level/Third party library/by the application itself.

Jobs

  • Serial Jobs: Run on a single node
  • Parallel jobs: use multiple nodes
  • Interactive jobs: require fast turnaround time, and their input/output is directed to a termainal
  • Batch jobs: need more resources and don’t need immediate responses. Scheduled jobs.

job Management System

  • A user server: Let user submit jobs.
  • A job scheduler: performs job scheduling
  • A resource manager: allocates and monitors resources. Enforces scheduling policies, and collects accounting information.

Security Mechanisms

  • Encryption(AES, RSA)
  • Authentication(Password, Public key)
  • Authorization(access control)

  • Concurrency
    • Fair scheduling
    • Preserve dependencies
    • Avoid deadlocks
    • Object locking, data consistency, semaphores
  • Fault tolerance (No failure despite faults)
    • Fault detection
      • Checksums
      • Heartbeat
    • Fault masking
      • Retransmission of corrupted messages
      • Redundancy
    • Fault toleration
      • Exception handling
      • Timeouts
    • Fault recovery
      • Rollback mechanisms
  • Scalability
  • Openness
  • Distribution transparency <= Do not let other touch

Information Security Notes 1 - Classical Crypto System

Information Security Notes 1 - Classical Crypto System

What is Encryption

Encryption is composed of a key and an Encryption algorithm.

  1. Type of operations used for transforming plaintext to ciphertext
  2. The number of keys used
  3. The way in which the plaintext is processed

Encryption and Decryption

  • Unencrypted message = plaintext/message
  • Encrypted message = cipher/ciphertext

Cryptanalysis

Means attack

  • Brute-force
    • Tries every possible
  • Breaking the algorithm
    • Tries to exploit the weakness of the encryption algorithm

How to measured by the following dimensions

  • Attacker models
    • How strong is the attacker
    1. Ciphertext only attacks
    2. Known plaintext attacks
    3. Chosen plaintext attacks(Attacker can choose plaintext on his own)
    4. Chosen ciphertext attacks(Attacker can choose the cipher and obtain the plaintext)
  • Security Goal
    • What Goals does your attacker wants to achieve
    1. Computationally secure: The cost of breaking the cipher exceeds the value of the encryptited information
    2. Unconditionally secure: No matter how much time as opponent has, it is impossible for people decrypt. (secure against brute-force)
  • Assumptions:
    • What is the computational limitation
    • Always better to over-estimate the ability of your attackers
    1. Computation: attacker might have many computing resource(super-computers)
    2. Network: attacker might have control over the network/communication channel, they can send/drop/inject/view your packet
    3. Some problems are hard(NP=/=P), no polynomial time solutions
    4. We generally assume computation requiring 2^80 is unsolvable

Brute-force attack

  • Attackers try all possible sets of keys
  • By probability, it has to try at least half of them
  • We generally assume computation requiring 2^80 is unsolvable

Algorithm - Dynamic Programming

Algorithm - Dynamic Programming

What is Dynamic Programming?

Principle of Optimality (Bellman, 1957):
An optimal sequence of decisions has the property that whatever the initial state and decision are, the remaining decisions must constitute an optimal decision with regard to the state resulting from the first decision.

The Other Algorithmic Design Philosophy

  1. Divide-and-Conquer:
    The problem is divided and the subproblems are processed in a recursive manner, but the solutions of Divide-and-Conquer subproblems are usually not repeated, and when they are repeated, the same subproblems are usually recalculated.

  2. Greedy Approach:
    At each stage, starting from a certain starting point, each input is checked one by one to see if it is suitable to be added to the answer. If it is not possible to find a selection procedure to check one by one for the optimization problem to be handled, we will discuss it later.

Divide and Conquer and Dynamic Programming are very similar. The difference is that Dynamic Programming’s subproblems have many overlaps, which can be stored in a table without recalculation, exchanging space for time.

Internetional Relationship - Tools and Target

Internetional Relationship - Tools and Target

Tools

  1. Political Diplomacy
  2. Economy and Trade
  3. Military
  4. Cultural Promotion
  5. Spy

Target

  1. Attitude
  2. Weakening Enemy
  3. Seize benefits
  4. Eliminate the opponent

Networking Notes 6 - Network Performance Measures

Network Performance Measures

  • Throughput
    • How much data travels across the network
  • Average packet delay
    • How long does it take for a packet to arrive at its destination
    • How responsive is the system to user commands
    • Can the network support real-delivery such as audio and video

Routing Algorithm

“Good” path:

  • The minimum cost path
  • Other definitions are possible

Hop count

The number of routers a packet must go through

Least-cost routing

Associate a cost with each link

  • Bandwidth
  • Delay
  • Reliability

To make routing decision:

  • Topology of the network

  • Traffic load

  • Link cost

  • Information should always keep up-to-date

  • More information

Fixed Routing

  • Central Directory is usually stored at a network control center

  • Matrix table

  • No different between Datagram and VC

  • Advantages

    • Simplicity
    • Work Well in a reliable network with a stable load
      (IPLC?)
  • Disadvantages

    • Lack of flexibility
    • Do not react to network congestion or failures

Flooding Algorithm

Packet sent to every neighbor node

  • No network information required

  • Evenually a number of copies will arrive at destination

  • Guarantees the packet reaches the destination in the shortest time

  • Duplicate packets are generated.

  • Node can remember packets already forwarded

  • Advantage

    • Very rebust
    • Used for sending emergency messages
  • Disadvantage

    • High traffic load

Random Algorithm

The outgoing link is chosen at random
Round-robin?

  • Advantages
    • Robust and simple
    • no network information
    • Less traffic load compare to flooding
  • Disadvantages
    • Longer path
    • Performance not guaranteed

Adaptive Routing

Used by almost all packet switching network
** routing decision changes as network conditions change**

  • Failure Link

  • Network congestion

  • Require network information

  • Network information must be exchanged among the nodes

  • Decisions are more complex

  • Advantages

    • Impoved performance
    • Congestion control
  • Disadvantages

    • Substantial processing burden on nodes
    • Increased network traffic traffic due to the exchange of network information
    • Reacting too quickly can cause oscillation

Distance Vector Routing

Each node communicates only with directly-attached neighborsA routing table is created by building up a common set of routers in close proximity to each other, hence the term “distributed”.
Each router on the network must maintain a two-dimensional Vector Table, which records the best known distance from its own router to each router.
A router periodically builds a routing table by exchanging vector tables with neighboring routers (not broadcast to all routers).
When a router receives a vector table from its neighbors and then corrects its own vector table, the contents of the vector table are continually corrected and retransmitted, and the entire network state is gradually passed to each router.
As the routing tables on the routers become more complete, it becomes possible to find the best path. The vector table is sent only to the neighboring routers, which consumes less bandwidth and does not cause broadcast oscillation.

Information sharing - each router shares its knowledge of its neighbourhood with all routers in the network.
After all LASs from all nodes are gathered, the entire map of the network can be constructed.

Least-Cost Algorithms

  • Dijkstra’s Algorithm
  • Bellman - Ford Algorithm

We Will come back for the Algorithm later

Networking Notes 5 - Switching

Pactket-Switched Networks: Routing

Two Main approaches to implement packet switching

  • Datagram approach
  • Virtual circuit approach

Datagram Network

  • connection-less network
  • destination address determines next hop
  • routes may changee during session

Datagram Approach

  • Connectionless switching
  • Each packet(datagram) is treateed indeependently
  • Packets may go by different paths across the network
  • May arrive out of order, the transport layer neeeds to reorder them

Virtual circuit network

  • Connection-oriented switching
  • Call request and call accept packets are used to establish the connection between sender and receiver
  • Each packet carries tag(virtual circuit ID), tag determines next hop
  • fixed path determined at call setup time, remains fixed thru call
  • routers maintain per-call state
  • Path is fixed

Virtual Circuit approach

  • All packets of the same message are transferred via a preplanned route(same route)
  • Arrive in order
  • Eatablish between sender and receiver at the beginning

There are two main virtual circuit

Permanent virtual circuit(PVC)

  • Set up by the network provider
  • No need to terminate the VC after transmission, and no need to set up the VC before it(because it is already seted up by network provider)

Switch virtual circuit

  • Setup everytime when a VC is needed, and terminate after the transmission
  • May get different VC according to network conditions
  • More flexible, but required extra set up time before data transfer begins
  1. Connection establishment
  2. Data transfer
  3. Connection release

VC vs Datagram

VC

  • sequencing and error control
  • No routing decisions to make -> Packets are forwarded more quickly
  • Less reliable

Datagram

  • No call setup phase - Good for bursty data
  • More flexible
    • can have alternate route
    • can avoid congested parts of the network

Switching

Switching in computer network helps in deciding the best route for data transmission if there are multiple paths in a larger network
One-to-One connection

Packet Switching

  • The internet is a packet switched network
  • Message is broken into individual (Packets)
  • Each packet is sent individually
  • Each packet will have source and destination IP address with sequence number.
  • sequence number will help receiver
    • Reorder the packets
    • Detect missing packets
    • Send acknowledgments

Networking Notes 4 - Network support group

Network Layer

Network layer is concerned with getting packets from source all the way to destinations.

It is the lowest layer that deals with end-to-end transmission, and may make many hops at intermediate nodes.

Physical and data link layers only operate locally.

Hop - is one portion of the path between source and destination.

Network-Layer Function

  • Connect heterogeneous networks together to look like a single network (Internetworking)
  • Uniquely identify each device (addressing)
  • Make decision to deliver data (routing)
  • Fit data to size used by the lower layer protocol (fragmentation)

Jobs for sending side: encapsulates segments into datagrams

Jobs for receiving side: delivers segments to transport layer

Address in TCP

Physical address

Logical address

Port address

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/9f369c2b-6bed-4623-870e-6c4ab770c8b0/Untitled.png

Physical Address

Address of a node as defined by its LAN or WAN

In a data link level, frame transfer only contains Physical addresses in the header.

Most LAN use a 48-bit(6bytes) physical address written as 12 hexadecimal digits with every digits separated by a colon.

https://s3-us-west-2.amazonaws.com/secure.notion-static.com/9e142654-d8cb-466b-bea5-0d313b5a978a/Untitled.png

Physical Address also called Media Access Control (MAC) address

MAC address is performed via the Address Resolution Protocol (ARP)

Logical Address

Universal address that can pass through the LAN boundaries.

Address of a LAN. Logical Address + Physical Address needed to send data to another LAN network.

Port Address

Computer can run multiple processes at the same time, but need to label different processes by using port address

Port address is 16-bit address. (2^16) 0 - 65535, 65536 in total.

DNS

Domain Name System

Translation of a hostname into an IP address.

Networking Notes 3 - Internet Transport-Layer Protocol

More than one transport-later protocol available to applications

Internet: TCP and UDP

TCP

  • Reliable, in-order delivery
  • Connection setup
  • Flow control
  • Congestion control

UDP

  • Unreliable, unordered delivery

More about TCP

Full name: Transmission Control Protocol

  • Connection-oriented protocol
  • Provide a reliable unicast end-to-end byte stream over an unreliable internetwork

Before any data transfer, TCP establishes a connection(hand shark?)

  1. Client: send Request a connection
  2. Server: Accept a connection
  3. Do Data Transfer
  4. Client: Disconnect

Reliable

  • The reason why we use TCP is because it is reliable.
  • Byte stream is broken up into chunks - segments
  • The receiver sends acknowledgements (ACKs) for segments.
  • TCP maintains a timer, if ACK is not received in time, the segment is retransmitted.
  • Detecting errors (checksums for header and data)
  • Each byte has a sequence number

UDP

Full name : User Datagram Protocol

  • Connectionless - no handshark
  • Unreliable
  • No flow control
  • No congestion control

Good:

  • No connect establishment ← can add delay
  • simple
  • small segment header
  • no congestion control: as fast as desired

Good for: Streaming multimedia application

  • Loss tolerant
  • rate sensitive
  • DNS