MIT 6.824 2022 notes 1


Lecture 1: Introduction

why do we use distributed systems?

  1. connect physically seperated machines
    • Usually shared
  2. increase capacity through parallelism
  3. tolerate faults
  4. achieve security/via isolate

What are the challenges?

  1. Concurrency
  2. Partial failure
  3. High performance

Infrastructure Abstractions

  • storage
  • communication
  • Computation

Implementation

RPC, threads, concurrency lock.

performace

Scalability

Fault Tolerance

  • Availability
  • Recoverbility

Solutions:

  • NV storage(non-volatile storage)非易失性存储
  • Replication: management of replication copies,

Consistency

puts and gets for k-v store

reasons for non-consistency

  • cache
  • Replication

kinds

  • Strong
    • expensive to implement
      • Communication
  • weak
    • Real world

Mapreduce

MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.

A MapReduce program is composed of a map procedure, which performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), and a reduce method, which performs a summary operation (such as counting the number of students in each queue, yielding name frequencies). The “MapReduce System” (also called “infrastructure” or “framework”) orchestrates the processing by marshalling the distributed servers, running the various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance.

——Wiki

Definitions

Map(k, v): k-the name of the file, v-the content of the maps inout file(full text).

Reduce(k, v): k-the word, v-value to be summaryed(count or frequency).

Master server: organize the whole computation

worker server: computation work

Working process

If we want to run a MapReduce job that takes the entire crawled web as input the data is already stored in a way that split up evenly across all the file servers(a web file) and so that means that the map workers are able to read the data in parallel from these file servers.

Does Google have one set of physical machines among GFS and a separate set of physical machines that run MapReduce jobs?

No. (just for 2004)

In general case if we have big files stored in some big Network file system like GFS. MapReduce worker process has to go off and talk across the network to the correct GFS server and maybe servers that store the data are part of the input and fetch servers spread data over the network to pass the map to the MapReduce worker machine.

In this case we need massive network throughput capacity.

If you just pick some MapReduce worker and some GFS servers, chances are at least half the time the communication between them has to pass through. This one wouldn’t switch their routes which had only some amount of total throughput.

Usually we run the GFS servers and the MapReduce workers on the same set of machines. Possibly we have a thousand machines running the service.

Now network is much faster and we can collect and process data from anywhere.

GFS splits a big file into 64M pieces and process them on single physical machines, finally the MapReduce framework would gather them up and write them into giant files in GFS via network.(needs massive network throughput)


文章作者: homie
版权声明: 本博客所有文章除特別声明外,均采用 CC BY 4.0 许可协议。转载请注明来源 homie !
  目录