Condor High Throughput Distributed Computing System

Viewer
Transcript

Condor

(by University of Wisconsin-Madison)

High Throughput Distributed Computing System Presented by Mark Silberstein 2006

CDP 236370, Technion

1

Definitions ●

●

●

●

●

Cluster [pool] – a group of interconnected computers (resources) Batch (job) – self-contained piece of software for unattended execution Batch/queuing system – system for automatic scheduling and invocation of jobs, competing for resource [multiple resources] High Performance System – optimized for low latency of every job High Throughput System – optimized to increase utilization of resources –

Ex: printer queue CDP 236370, Technion

2

Batch system – take 1 multiple identical resources ● ●

●

CPU

CPU

CPU

Job Queue Invokes jobs and brings results back Job “babysitting” –

Invoke only once

–

Job failures

CDP 236370, Technion

3

Batch system – take 2

distributed heterogeneous resources Job Queue Invokes jobs and brings results back Job “babysitting”

● ●

CPU

CPU

CPU

●

Remote control Report resource characteristics (metadata) Job requirements –

“I want CPU at least than ...”

CDP 236370, Technion

4

Batch system – take 3 distributed heterogeneous resources

Multiple users

Job Queue Invokes jobs and brings results back Job “babysitting” Remote control Job requirements Resource attributes (metadata)

● ●

CPU

CPU

CPU

●

Security Resource sharing policies – QoS Queue – Access control

CDP 236370, Technion

5

Batch system – take 4 distributed heterogeneous resources + Multiple users

Non-dedicated resources – cycle stealing Job Queue + Access control Invokes jobs and brings results back Job “babysitting” Remote control Job requirements Resource attributes (metadata) –

update

periodic

Security Resource sharing policies – QoS

CPU

CPU

CPU ●

On-demand job eviction

●

Fault tolerance

Respecting resource policies CDP 236370, Technion 6 ●

Condor at glance Submission hosts ●

Basic idea – “classified advertisement” matching –

Resources publish their capabilities

–

Jobs publish their requirements

–

Matchmaker finds best match

Matchmaker

CPU

CDP 236370, Technion

CPU

Execution hosts

7

Condor architecture Submission host: schedd and shadow ●

●

Schedd - Job Queue –

Holds DB of all jobs submitted for execution (fault-tolerant)

–

Requests resources from MM

–

Claiming logic

–

Ensures only-once semantics

Submission host

Sched d

Shadow1 Shadow2

Shadow (per running job) –

Remote invocation

–

Input/Output staging

–

Job “babysitting” - failure identification

–

Sometimes works as I/O proxy CDP 236370, Technion

CPU

CPU Matchmaker

CPU

8

Condor architecture ●

Execution host: startd and starter Startd – resource manager –

Monitoring ● ●

●

–

●

Schedd

Keeps track of resource usage

Shadow Matchmaker

Periodically sends resource attributes to MM Enforces local policies

Execution gateway

startd

●

Security

●

Spawns starter

●

Communicates with schedd

Starter (per running job) –

Communicates with shadow (I/O)

–

Environment creation and cleanup

–

Controls job execution

CDP 236370, Technion

Execution gateway

Starter Job1

Resource monitoring

Starter Job2

9

Matchmaker Collector –

Central registry of the pool metadata ●

●

Collector

Condor brain

●

fy

●

Attempts to match requests with resources

Noti

●

Periodically pulls info from collector

Negotiator h is bl Pu

●

b u S

All pool entities send reports to collector

Negotiator –

e b i r sc

No tify

●

Notifies happy pairs Maintains fair share of resources between users CDP 236370, Technion

10

Condor description language ClassAd ●

●

●

●

●

●

Used to describe entities - resources, jobs, daemons, etc. Schema-less!!! Mapping of attribute names to expressions Both descriptive and functional

●

Simple examples:

Ex1: Simple [ CPU=200; RAM=30 ] Ex2: Reference to local [ MyCPU=200; RAM=20; Power=(RAM+MyCPU)]

Ex3: Reference to other Expressions can contain [Type=job;Exec=test.exe; Requirements=other.RAM>200 attributes from other ] classads Protocol for expression evaluation

[ Type=resource; RAM=300;]

CDP 236370, Technion

11

Matching constraints ●

Matching process is symmetric: –

Matched only if both resource and job requirement expressions are true

[Type=job; Exec=test.exe; Requirements=other.RAM>200 ] [ Type=resource; RAM=300; Requirements=(Exec==test.e xe)]

CDP 236370, Technion

12

Example of resource classad MyType = "Machine" TargetType = "Job" Name = "[email protected]" Machine = "ds-i1.cs.technion.ac.il" Rank = 0.000000 CpuBusy = ((LoadAvg - CondorLoadAvg) >= 0.500000) CondorVersion = "$CondorVersion: 6.4.7 Jan 26 2003 $" CondorPlatform = "$CondorPlatform: INTEL-LINUX-GLIBC22 $" VirtualMemory = 1014294 Disk = 34126016 CondorLoadAvg = 0.000000 LoadAvg = 1.000000 KeyboardIdle = 26038 Arch = "INTEL" OpSys = "LINUX" UidDomain = "cs.technion.ac.il" FileSystemDomain = "cs.technion.ac.il" Subnet = "132.68.37" HasIOProxy = TRUE

...

CpuBusyTime = 2109520 CpuIsBusy = TRUE State = "Owner" EnteredCurrentState = 1084352386 Activity = "Idle" EnteredCurrentActivity = 1084352386

Start = (Scheduler =?= "[email protected]") || (((Keybo ardIdle > 15 * 60) && (((LoadAvg - CondorLoadAvg) <= 0.300000 && ((CurrentTime EnteredCurrentState) >= 60)) || (State != "Unclaimed" && State != "Owner"))))

Requirements = START CDP 236370, Technion

13

Matchmaker in details ●

Collector stores –

All resources' classads

–

All schedds' classads ●

Represent only number of jobs, and their owners, but not jobs themselves

–

Information is always outdated

–

Periodically removes staled data ●

soft registration

CDP 236370, Technion

14

Idle state (periodic update and soft state) Remove stale data ( garbage collection)

Collector Schedd Classad: Number of idle jobs in queue, IP:port

Startd classad: Resource state, Resource characteristics

Schedd

Startd

Important: this diagram is valid throughout all life of schedd and startd CDP 236370, Technion

15

Negotiator ●

Periodic negotiation cycle 1)Pulls all classads (once per cycle) 2)Contacts each schedd according to priority and gets job classad

3)For each job traverses all resources' classads and attempts to match each one 4)If found 1)Chooses best match according to global and local policies 2)Notifies matched parties 3)Remove matched classad 5)If not found – tries next job from the same schedd, or next schedd CDP 236370, Technion

16

Claiming and running

Startd

Negotiator Sta e rv D ) M rtd e at s hI e ch IP, R a tc ID M (

Schedd

Activate claim: are you available? Yes Run job Alive Alive Job finished, has more? Release claim: no, thanks CDP 236370, Technion

17

Negotiation state sequence diagram Schedd

Negotiator Get next schedd with idle jobs

Choose next job to match

Get next job classad Claim ID and address of matched startd

Collector

Startd

Fetch all classads

Single negotiation cycle Perform matchmaking Assign Claim ID

I am claimed

Activate claim with received Claim ID Repeat until there are idle jobs

Validation of correct match

Send job classad Start new job Job complete CDP 236370, Technion

18

Startd resource monitoring ●

●

Periodic sampling of system resources –

CPU utilization, Hard Disk, Memory ...

–

User-defined attributes

–

If job is running – total running time, total load by job, ...

Published in classad and can be matched with

CDP 236370, Technion

19

Startd policies (support for cycle-stealing) ●

Resource owner can configure –

When resource is considered available ●

–

What to do when owner is back ●

–

Suspend job to RAM

How to evict job ●

●

Ex: only after 15 min after keyboard is idle

Job should be killed at most 5 sec after I want resource back

Pool manager has no control over these policies CDP 236370, Technion

20

Global resource sharing policies ●

●

How should resources be shared between users? What happens without policies: –

●

1000 Computers, User A starts 1000 jobs, 5 hours long, User B will have to wait ;(((

Solution – fair share –

User with higher priority can preempt another job ●

●

Priorities change dynamically according to the resource usage : more resources – worse priority Priouser(t)=k*Priouser(t-dt) + (1-k)*(number of used resources) , where k=0.5dt/(priority half life) CDP 236370, Technion

21

Putting policies together: Negotiation cycle revisited (Condor 6.6 series) ●

Periodic negotiation cycle

1)Pull all classads (once per cycle) and optimize for matching 2)Order all schedd requests by user priorities // higher priority – served first

New job

3)For each user While (user quota is not exceeded AND has more job requests ) do 1)Contact schedd and get next job classad 2)Traverse all resources' classads and attempt to match one by one 1)If not found – notify schedd; goto NEW JOB 2)If match is found, AssignWeights(), add to matched list 3)ChooseBestMatch() and Notify() CDP 236370, Technion

22

Putting policies together: Negotiation cycle revisited(cont) ●

Function AssignWeight() 1) Assign preemption weight: ●

●

●

2 – if resource is idle 1 – if resource is busy and prefers new job over current one (Resource Rank evaluation) 0 – if resource is busy, current user has higher priority and global policy permits preemptions

2) Evaluate job preferences (Job Rank evaluation) ●

Function ChooseBestMatch() : lexicographic order –

Sort according to job rank, pick best one

–

Among all with equal best rank– sort according to preemption CDP 236370, Technion 23 weight

Condor and MPI parallel jobs ●

●

Problem: MPI requires synchronous invocation and execution of multiple instances of a program. Why it is a problem: –

Negotiator matches only one job at a time

–

Schedd knows to invoke one job at a time

–

Different failure semantics: single instance failure IS A whole MPI job failure

–

Startd might prevent single job, but this would kill the whole MPI run CDP 236370, Technion

24

MPI Universe ●

●

●

●

●

Each Startd capable of running MPI job publishes attribute: “DedicatedScheduler=”. Each MPI sub-job has a requirement to run on a host with DedicatedScheduler defined Negotiator matches all such hosts and passes them to Schedd Dedicated Schedd is responsible for synchronous invocation and failure semantics Dedicated Schedd can preempt any job on that host CDP 236370, Technion

25

Condor in the Technion ●

●

●

Condor is deployed in DSL,SSDL and CBL ( total ~200 CPUs) Gozal: R&D projects for Condor enhancements. Among them –

High availability

–

Distributed management and configuration

–

Resource sandbox

–

On the web: http://dsl.cs.technion.ac.il/projects/gozal/

Superlink-online: genetic linkage analysis portal CDP 236370, Technion

26

References ●

www.condorproject.org –

Condor administration manual

–

Research papers

–

Slides from the previous year lecture

CDP 236370, Technion

27

DISTRIBUTED SYSTEM AND GRID COMPUTING .pdf