KLam's Tech life

Posted 2021-03-20Updated 2025-06-19Notes6 minutes read (About 884 words)

Cloud Notes of Technical Issues in Distributed System

Time Synchronization
Coordination and agreement
Transactions and concurrency control

Time synchronization

Timing is important, for accurately.

Computers each have their own physical clocks

Due to the structural differences between servers, different time drifts are generated after a period of time, so that the physical clocks of different servers differ to some extent. As a direct result, event A may occur in a later order than event B, but the timestamp sent over is indeed less than B. If the synchronisation of state is involved B’s data will overwrite A’s data, which we don’t want to see.

Electronic devices that count oscillations occuring in a crystal at a frequency.
Operating System reads the hardware clock value.
Not perfect
- Clock skek: the instantaneous difference between the readings of any two clocks
- Clock drift: different crystal-based clock count time at different rates
  - Temperature matter
  - Drift rate: The change in the offset between the clock and a nominal perfect reference clock per unit of time

External syncronization

Synchronize a group of clocks with an authoritative external source of time
For example, UTC: Coordinated Universal Time
Network Time Protocol(NTP)

Process Time: t+T(round)/2

Internal syncronization

Synchronize between a group of computer. A coordinator computer is chosen to be the master. Other computers are slaves. Master periodically polls the slaves, and the slaves send back their clock values.

Berkeley Algorithm
Cristian’s Method

Distributed Mutual Exclusion

safety - at most one process can execute at a time
liveness - requests to enter and exit the critical section eventually succeed, freedom from deadlock and starvation
Ordering - entry to thee critical section is granted in that order.

Evaluated by:

Consumed bandwidth
- required two messages to enter the critical section(request message & grant message)
- required one messages to exit the critical section(a release message)
Client delay
- Round-trip delay
Throughput(synchronization delay)
- THe time for a release messages to the derver and a grant message to the next process.

Coordination and agreement

Transations and concurrency control

Motivation of Synchronization

Recoverable to handle process crash
Multiple clients access the same object concurrently
Atomic operation

Atomicity Transactions “原子不可分割”

All or nothing
- either completes successfully
- either has no effect at all
Isolation
- Each transaction must be performed without interference from other transactions
- No observation

Concurrency Control

Lost update
- Use old value to calculate a new value
inconsistent retrievals
- Transaction observes values that are involved in an ongoing updating transaction

Rules of Serial Equivalence

All pairs of conflicting operations of the two transactions be executed in the same order

FIFO?

Locking

Exclusive lock - Pessimistic Lock
Only one can access the object at the same time
Assuming that concurrency conflicts will occur, block any operations that may violate data integrity.

Java synchronized is an implementation of pessimistic locking, where every time a thread wants to modify data it first obtains a lock, ensuring that only one thread can manipulate the data at any one time, while the others are blocked.

Optimistic Lock
Timestamp/version
When the update is committed, check the timestamp of the data in the current database and compare it with the timestamp you got before the update, if it is the same then it is OK, otherwise it is a version conflict.
Two Phase lock
Deadlock
- Detection:
  - Find cycles in the wait-for graph
  - Select a transaction for abortion to break the cycle
- Timeout
Read/Write Locks
- read lock before performs read operation
- write lock before performs write operation
- write lock is more exclusive

Optimistic concurrency control

Checks “conflict operations” before commit
If yes, aborts it and the client may restart

Timestamp ordering

Record the most recent time of reading and writing of each object
Compare timestamp => determine it can be done immediately or must be delayed or rejected.

Clusters

Benefits of computer clusters include

Scalable performance
High availability
Fault tolerance
Modular growth
Use of commodity components

Attributes of Computer Clusters

Scalability
Packaging
- Compact packaging: closely packaged in racks
- Slack packaging: Located in different locations
Control
- Centralized
- Decentralized
Homogeneity
- Homogeneous cluster: Node from the same platfrom
- Heterogeneous cluster: Node from the different platfrom

Architecture

OS should be designed multiuser, multitasking and multithreaded
interconnected by fast commodity networks
Cluster middleware glues together all node platforms at the user space

Design principles of Clusters

Single-System image (SSI)
The same client will see the same view of the service no matter which machine in the cluster it connects to.
Reliability
- operate without a breakdown
Availability
- percentage of time available to the user
Servoceability
- maintenance/repair/upgrades etc.

Operate-Repair cycle

Mean time to failure
- average time of fails
Mean time to repair
- average time to fix(restore)

Type of Failures

Unplanned failures vs. planned shutdowns
Transient failures vs. permanent failures
- reboot can fix
Partial failures vs. total failures
- part of the system, the cluster still usable

Fault-Tolerant

Host standby
only primary nodes are actively doing the useful work
Standby nodes are powered on and running some monitoring programs
Active-takeover
All servers are primary and doing useful work.
User may experience some delays or may lost some data
Failover
When a component fails, it allows the remaining system to take over the services

Failure Cost Analysis

MTTF, MTTR
Avilability(%)
The downtime per year(hours)
The yearly failure cost

Posted 2021-01-29Updated 2025-06-19Notes3 minutes read (About 457 words)

Distributed and Cloud Computing Notes 1

Reasons for Distributed Systems

Functional Separation
- Different Capabilities and purposes
Inherent Distribution
- Information
- People
Power imbalance and load variation
Reliability
Economies

Consequences of Distributed Systems

Concurrency - Each computer is autonomous
- Carry our tasks independently
- Tasks coordinate their actions by exchanging messages
- System capacity can be increased by adding more resources
No global clock
Independent Failures

Motivation of Distributed Systems

To share resource and information

Trends in Distributed Systems

The emergence of pervasive networking technology
The emergence of mobile and ubiquitous computing
The increasing demand for multimedia services
The view of distributed systems as a utility

Maintenance of intranet

No rick if no connection to internet
Firewalls are used to limit services from/to an intranet
- Limit FTP/Remote Desktop etc.

Mobile computing: Performing computing tasks while the user is on the move, away from his/her usual environment

Eight forms of transparency

Access transparency
Location transparency
Concurrency transparency
Replication transparency
Failure transparency
Mobility transparency
Performance transparency
Scaling transparency

List of Challenge

Heterogeneity
Security
- Confidentiality
  - Protection against disclosure to unauthorized individual information
- Integrity
  - Protection against alteration or corruption
- Availability
  - Protection against interference targeting access to the resources
  - DDoS
- Authenticity or Non-repudiation
  - Proof of sending / receiving an information
  - digital signature

Failure

Availability =MTTF/(MTTF+MTTR)

Mean time to failure(MTTF)
- The average time of normal operation before the system fails
Mean time to repair (MTTR)
- The average time it takes to repair the system and restore it to working condition

Single point failure

Single hardware/Software component failures cause the whole system crash. The key approach to enhancing availability is to make as many as possible partial failures by removing single points of failure

Checkpointing

The process of periodically saving the stage of an executing program to stable storage, from which the system can recover after a failure.
- Each program stae saved is called a Checkpoint .
- Checkpointing can be realized by operating system at kernel level/Third party library/by the application itself.

Jobs

Serial Jobs: Run on a single node
Parallel jobs: use multiple nodes
Interactive jobs: require fast turnaround time, and their input/output is directed to a termainal
Batch jobs: need more resources and don’t need immediate responses. Scheduled jobs.

job Management System

A user server: Let user submit jobs.
A job scheduler: performs job scheduling
A resource manager: allocates and monitors resources. Enforces scheduling policies, and collects accounting information.

Security Mechanisms

Encryption(AES, RSA)
Authentication(Password, Public key)
Authorization(access control)

Concurrency
- Fair scheduling
- Preserve dependencies
- Avoid deadlocks
- Object locking, data consistency, semaphores
Fault tolerance (No failure despite faults)
- Fault detection
  - Checksums
  - Heartbeat
- Fault masking
  - Retransmission of corrupted messages
  - Redundancy
- Fault toleration
  - Exception handling
  - Timeouts
- Fault recovery
  - Rollback mechanisms
Scalability
Openness
Distribution transparency <= Do not let other touch