Cloud Notes of Technical Issues in Distributed System

Cloud Notes of Technical Issues in Distributed System

  1. Time Synchronization
  2. Coordination and agreement
  3. Transactions and concurrency control

Time synchronization

Timing is important, for accurately.

Computers each have their own physical clocks

Due to the structural differences between servers, different time drifts are generated after a period of time, so that the physical clocks of different servers differ to some extent. As a direct result, event A may occur in a later order than event B, but the timestamp sent over is indeed less than B. If the synchronisation of state is involved B’s data will overwrite A’s data, which we don’t want to see.

  • Electronic devices that count oscillations occuring in a crystal at a frequency.

  • Operating System reads the hardware clock value.

  • Not perfect

    • Clock skek: the instantaneous difference between the readings of any two clocks
    • Clock drift: different crystal-based clock count time at different rates
      • Temperature matter
      • Drift rate: The change in the offset between the clock and a nominal perfect reference clock per unit of time

External syncronization

Synchronize a group of clocks with an authoritative external source of time
For example, UTC: Coordinated Universal Time
Network Time Protocol(NTP)

Process Time: t+T(round)/2

Internal syncronization

Synchronize between a group of computer. A coordinator computer is chosen to be the master. Other computers are slaves. Master periodically polls the slaves, and the slaves send back their clock values.

  • Berkeley Algorithm
  • Cristian’s Method

Distributed Mutual Exclusion

  1. safety - at most one process can execute at a time
  2. liveness - requests to enter and exit the critical section eventually succeed, freedom from deadlock and starvation
  3. Ordering - entry to thee critical section is granted in that order.

Evaluated by:

  1. Consumed bandwidth
    • required two messages to enter the critical section(request message & grant message)
    • required one messages to exit the critical section(a release message)
  2. Client delay
    • Round-trip delay
  3. Throughput(synchronization delay)
    • THe time for a release messages to the derver and a grant message to the next process.

Coordination and agreement

Transations and concurrency control

Motivation of Synchronization

  1. Recoverable to handle process crash
  2. Multiple clients access the same object concurrently
  3. Atomic operation

Atomicity Transactions “原子不可分割”

  1. All or nothing
    • either completes successfully
    • either has no effect at all
  2. Isolation
    • Each transaction must be performed without interference from other transactions
    • No observation

Concurrency Control

  1. Lost update
    • Use old value to calculate a new value
  2. inconsistent retrievals
    • Transaction observes values that are involved in an ongoing updating transaction

Rules of Serial Equivalence

All pairs of conflicting operations of the two transactions be executed in the same order

FIFO?

Locking

  • Exclusive lock - Pessimistic Lock
    Only one can access the object at the same time
    Assuming that concurrency conflicts will occur, block any operations that may violate data integrity.

Java synchronized is an implementation of pessimistic locking, where every time a thread wants to modify data it first obtains a lock, ensuring that only one thread can manipulate the data at any one time, while the others are blocked.

  • Optimistic Lock
    Timestamp/version
    When the update is committed, check the timestamp of the data in the current database and compare it with the timestamp you got before the update, if it is the same then it is OK, otherwise it is a version conflict.

  • Two Phase lock

  • Deadlock

    • Detection:
      • Find cycles in the wait-for graph
      • Select a transaction for abortion to break the cycle
    • Timeout
  • Read/Write Locks

    • read lock before performs read operation
    • write lock before performs write operation
    • write lock is more exclusive

Optimistic concurrency control

Checks “conflict operations” before commit
If yes, aborts it and the client may restart

Timestamp ordering

Record the most recent time of reading and writing of each object
Compare timestamp => determine it can be done immediately or must be delayed or rejected.

Clusters

Benefits of computer clusters include

  1. Scalable performance
  2. High availability
  3. Fault tolerance
  4. Modular growth
  5. Use of commodity components

Attributes of Computer Clusters

  • Scalability
  • Packaging
    • Compact packaging: closely packaged in racks
    • Slack packaging: Located in different locations
  • Control
    • Centralized
    • Decentralized
  • Homogeneity
    • Homogeneous cluster: Node from the same platfrom
    • Heterogeneous cluster: Node from the different platfrom

Architecture

  • OS should be designed multiuser, multitasking and multithreaded
  • interconnected by fast commodity networks
  • Cluster middleware glues together all node platforms at the user space

Design principles of Clusters

  • Single-System image (SSI)
    The same client will see the same view of the service no matter which machine in the cluster it connects to.
  • Reliability
    • operate without a breakdown
  • Availability
    • percentage of time available to the user
  • Servoceability
    • maintenance/repair/upgrades etc.

Operate-Repair cycle

  • Mean time to failure
    • average time of fails
  • Mean time to repair
    • average time to fix(restore)

Type of Failures

  1. Unplanned failures vs. planned shutdowns
  2. Transient failures vs. permanent failures
    • reboot can fix
  3. Partial failures vs. total failures
    • part of the system, the cluster still usable

Fault-Tolerant

  • Host standby
    only primary nodes are actively doing the useful work
    Standby nodes are powered on and running some monitoring programs
  • Active-takeover
    All servers are primary and doing useful work.
    User may experience some delays or may lost some data
  • Failover
    When a component fails, it allows the remaining system to take over the services

Failure Cost Analysis

  • MTTF, MTTR
  • Avilability(%)
  • The downtime per year(hours)
  • The yearly failure cost

Distributed and Cloud Computing Notes 1

Distributed and Cloud Computing Notes 1

Reasons for Distributed Systems

  • Functional Separation
    • Different Capabilities and purposes
  • Inherent Distribution
    • Information
    • People
  • Power imbalance and load variation
  • Reliability
  • Economies

Consequences of Distributed Systems

  • Concurrency - Each computer is autonomous
    • Carry our tasks independently
    • Tasks coordinate their actions by exchanging messages
    • System capacity can be increased by adding more resources
  • No global clock
  • Independent Failures

Motivation of Distributed Systems

  • To share resource and information
  • The emergence of pervasive networking technology
  • The emergence of mobile and ubiquitous computing
  • The increasing demand for multimedia services
  • The view of distributed systems as a utility

Maintenance of intranet

  • No rick if no connection to internet
  • Firewalls are used to limit services from/to an intranet
    • Limit FTP/Remote Desktop etc.

Mobile computing: Performing computing tasks while the user is on the move, away from his/her usual environment

Eight forms of transparency

  • Access transparency
  • Location transparency
  • Concurrency transparency
  • Replication transparency
  • Failure transparency
  • Mobility transparency
  • Performance transparency
  • Scaling transparency

List of Challenge

  • Heterogeneity
  • Security
    • Confidentiality
      • Protection against disclosure to unauthorized individual information
    • Integrity
      • Protection against alteration or corruption
    • Availability
      • Protection against interference targeting access to the resources
      • DDoS
    • Authenticity or Non-repudiation
      • Proof of sending / receiving an information
      • digital signature

Failure

Availability =MTTF/(MTTF+MTTR)

  • Mean time to failure(MTTF)
    • The average time of normal operation before the system fails
  • Mean time to repair (MTTR)
    • The average time it takes to repair the system and restore it to working condition

Single point failure

Single hardware/Software component failures cause the whole system crash. The key approach to enhancing availability is to make as many as possible partial failures by removing single points of failure

Checkpointing

  • The process of periodically saving the stage of an executing program to stable storage, from which the system can recover after a failure.
    • Each program stae saved is called a Checkpoint .
    • Checkpointing can be realized by operating system at kernel level/Third party library/by the application itself.

Jobs

  • Serial Jobs: Run on a single node
  • Parallel jobs: use multiple nodes
  • Interactive jobs: require fast turnaround time, and their input/output is directed to a termainal
  • Batch jobs: need more resources and don’t need immediate responses. Scheduled jobs.

job Management System

  • A user server: Let user submit jobs.
  • A job scheduler: performs job scheduling
  • A resource manager: allocates and monitors resources. Enforces scheduling policies, and collects accounting information.

Security Mechanisms

  • Encryption(AES, RSA)
  • Authentication(Password, Public key)
  • Authorization(access control)

  • Concurrency
    • Fair scheduling
    • Preserve dependencies
    • Avoid deadlocks
    • Object locking, data consistency, semaphores
  • Fault tolerance (No failure despite faults)
    • Fault detection
      • Checksums
      • Heartbeat
    • Fault masking
      • Retransmission of corrupted messages
      • Redundancy
    • Fault toleration
      • Exception handling
      • Timeouts
    • Fault recovery
      • Rollback mechanisms
  • Scalability
  • Openness
  • Distribution transparency <= Do not let other touch