Operating Systems Chapter 10: File systems
2017. Spring Instructor: Joonho Kwon
[email protected] Data Science Laboratory, PNU
These slides are based on the slides prepared by Anthony Joseph.
How do we Hide I/O Latency?
Blocking Interface: “Wait”
Non-blocking Interface: “Don’t Wait”
When request data (e.g., read() system call), put process to sleep until data is ready When write data (e.g., write() system call), put process to sleep until device is ready for data Returns quickly from read or write request with count of bytes successfully transferred to kernel Read may return nothing, write may write nothing
Asynchronous Interface: “Tell Me Later”
When requesting data, take pointer to user’s buffer, return immediately; later kernel fills buffer and notifies user When sending data, take pointer to user’s buffer, return immediately; l ater kernel takes data and notifies user 2
I/O & Storage Layers Operations, Entities and Interface Application / Service High Level I/O
streams
Low Level I/O
handles
Syscall
registers file_open, file_read, … on struct file * & void *
File System
I/O Driver
descriptors
we are here …
Commands and Data Transfers Disks, Flash, Controllers, DMA
3
Recall: C Low level I/O
Operations on File Descriptors – as OS object representing the state of a file
User has a “handle” on the descriptor
#include
#include #include int open (const char *filename, int flags [, mode_t mode]) int create (const char *filename, mode_t mode) int close (int filedes)
Bit vector of: • Access modes (Rd, Wr, …) • Open Flags (Create, …) • Operating modes (Appends, …)
Bit vector of Permission Bits: • User|Group|Other X R|W|X
http://www.gnu.org/software/libc/manual/html_node/Opening-and-Closing-Files.html
4
Recall: C Low Level Operations ssize_t read (int filedes, void *buffer, size_t maxsize) - returns bytes read, 0 => EOF, -1 => error ssize_t write (int filedes, const void *buffer, size_t size) - returns bytes written off_t lseek (int filedes, off_t offset, int whence) - set the file offset * if whence == SEEK_SET: set file offset to “offset” * if whence == SEEK_CRT: set file offset to crt location + “offset” * if whence == SEEK_END: set file offset to file size + “offset” int fsync (int fildes) – wait for i/o of filedes to finish and commit to disk void sync (void) – wait for ALL to finish and commit to disk
When write returns, data is on its way to disk and can be read, but it may not actually be permanent!
5
Building a File System
File System: Layer of OS that transforms block interface of disks (or other block devices) into Files, Directories, etc.
File System Components
Naming: Interface to find files by name, not by blocks Disk Management: collecting disk blocks into files Protection: Layers to keep data secure Reliability/Durability: Keeping of files durable despite crashes, media failures, attacks, etc.
6
Recall: User vs. System View of a File
User’s view: Durable Data Structures
System’s view (system call interface): Collection of Bytes (UNIX) Doesn’t matter to system what kind of data structures you want to store on disk!
System’s view (inside OS): Collection of blocks (a block is a logical transfer unit, while a sector is the physical transfer unit) Block size sector size; in UNIX, block size is 4KB 7
Recall: Translating from User to System View File System
What happens if user says: give me bytes 2—12?
What about: write bytes 2—12?
Fetch block Modify portion Write out Block
Everything inside File System is in whole size blocks
Fetch block corresponding to those bytes Return just the correct portion of the block
For example, getc(), putc() buffers something like 4096 bytes, even if interface is one byte at a time
From now on, file is a collection of blocks
8
Disk Management Policies (1/2)
Basic entities on a disk:
File: user-visible group of blocks arranged sequentially in logical space Directory: user-visible index mapping names to files
Access disk as linear array of sectors. Two Options:
Identify sectors as vectors [cylinder, surface, sector], sort in cylinder-major order, not used anymore Logical Block Addressing (LBA): Every sector has integer address from zero up to max number of sectors Controller translates from address physical position First case: OS/BIOS must deal with bad sectors Second case: hardware shields OS from structure of disk 9
Recall: Disk Management Policies (2/2)
Need way to track free disk blocks
Link free blocks together too slow today Use bitmap to represent free space on disk
Need way to structure files: File Header
Track which blocks belong at which offsets within the logical file structure Optimize placement of files’ disk blocks to match access and usage patterns
10
Designing a File System …
What factors are critical to the design choices? Durable data store => it’s all on disk (Hard) Disks Performance !!!
Open before Read/Write
Can write (or read zeros) to expand the file Start small and grow, need to make room
Organized into directories
Can perform protection checks and look up where the actual file resource are, in advance
Size is determined as they are used !!!
Maximize sequential access, minimize seeks
What data structure (on disk) for that?
Need to allocate / free blocks
Such that access remains efficient
11
File system
Data Structure Functions
12
이 슬라이드는 OLC Center의 서울대학교 고건 교수님의 자료를 바탕으로 제작되었음.
Terminology
Memory OS
Disk
Sector
block
buf
Page
sector
block
place in main memory where block is stored
page
multiple of disk sector (512B on x86) kernel performs all I/O in terms of blocks
buffer
smallest addressable unit in hardware
multiple of block (4KB on x86)
segment
a chunk of a buffer, contiguous in memory its size is smaller than block 13
Kernel Data Structure for File Process 1
Process 2
PCB
PCB
CPU
mem
FCB
CPU
mem
File
: Table (Data Structure)
: Object (hardware or software)
14
Meta-data for a File
Information kernel needs for a file: • • • • • • •
owner protection device content device driver routines accessing where now ….
(eg Clinton) (eg rwx r-- r--) (eg disk) (eg. sector address) (eg read(), open() ) (eg offset)
15
contiguous allocation hole hole
hole
Poor Insert/Delete ---------Free Space Management -----Compaction Wasted Space
scattered allocation
16
Contents of File FA may be stored in disk non-contiguously in units of disk sectors
File content
Why not contiguous allocation? (O) fast – if R/W whole content sequential use for swap, device copy, …
(X) space management many small holes (useless) external fragmentation
File content
File content File content
17
Kernel maintains metadata for each file
File
metadata
File content File content File contentFile content
18
File metadata includes pointers to data sectors
File
metadata
File content File content File contentFile content
19
Open() retrieves metadata from disk to main memory
File metadata
File
metadata
File content File content File contentFile content
But not contents – they are too big !!
20
This metadata has pointers to data sectors
File metadata
File metadata File content
File content File contentFile content
21
Now, you can reach any data sector through in-core metadata “offset” = byte position to R/W next
File metadata
File metadata File content
File content File contentFile content
22
Processes Sharing a File Three processes share same file three copies of file metadata in memory
PA
FX metadata FX
PB
FX metadata
PC
FX
metadata File content
File content File contentFile content
metadata 23
If PB updates metadata broadcast to all copies - risky (inconsistency) - inefficient
PA
FX metadata
PB
FX metadata
PC
FX metadata
FX
metadata File content
File content File contentFile content 24
Copying file metadata is expensive avoid copies to minimum share single copy whenever possible What about file offset? Cannot be shared, private copy/process
PA
PB
FX metadata FX metadata
FX metadata
PC
FX metadata
File content
File content File contentFile content 25
Split Metadata for file – – – – – –
owner protection information device pointer to file content device driver routines offset
Systemwide information All processes share single copy in memory (these fields part are not updated frequently)
“inode” table Private information Let each process have private copy since processes access different part
file table 26
two data structures for each file Private information
(system) file table PA PB
Systemwide information
inode table
offset
offset
Process private info Per-process data Next byte position to r/w
all the rest info.
shared info single copy globally
less frequently changed information 27
struct inode
inode inode
struct inode { char i_flag; char i_count; int i_dev; int i_number; int i_mode; char i_nlink; char i_uid; char i_gid; char i_size0; char *i_size1; int i_addr[8]; int i_lastr; } inode[NINODE];
/* reference count */ /* device where inode resides */ /* i number, 1-to-1 with device address */ /* directory entries */ /* owner */ /* group of owner */ /* most significant of size */ /* least sig */ /* sector addresses constituting file */ /* last logical block read (for read-ahead) */
28
struct file /* * One file structure is allocated for each open/creat/pipe call. * Main use is to hold the read/write offset */ struct file { char f_flag; char f_count; int f_inode; char *f_offset[2]; } file[NFILE];
/* flags */ #define FREAD 01 #define FWRITE 02 #define FPIPE 04
/* reference count */ /* pointer to inode structure */ /* read/write character pointer */
file
inode
offset
all the rest info. 29
Disk Space for ... Space for inode
Space for data blocks
inode inode inode inode
data block data block data blockdata block
File data size --- varies inode size --- fixed 30
Space for inode in Disk (Each inode - fixed size)
inode 0
inode 1
inode 2
i-number: ordinal number of inode in disk If I know (disk, i-number), I can access file content. disk name i-number
inode
content 31
Sharing files
Example
(Case-1) who, grep -- pipe file % who | grep share inode (pipe file), not share offset
pipe
who grep
sequence of bytes tty(in)
(Case-2) parent/child -- tty file % vi share inode (tty file), share offset
sh vi
32
Sharing files (system) file table
Inode table tty device
pipe $ grep|who
who
offset
inode
grep offset
process group $ vi
sh
inode offset
vi Pipe file offset
game
inode
game file 33
Device switch table
2-dim array which maps
(device name, operation name) => device driver routine
devswtab[]: open close read write ioctl
Read_lp
Starting address of read_lp() routine
device independence (above: file, below: device)
34
struct { int int int int int
Actually, one dimensional array of struct not two dimensional array
cdevsw (*d_open)(); (*d_close)(); (*d_read)(); (*d_write)(); (*d_sgtty)();
} cdevsw[];
device 1
device 2
open close read write ioctl
(*d_open)(); (*d_close)(); (*d_read)(); (*d_write)(); (*d_sgtty)(); (*d_open)(); (*d_close)(); (*d_read)(); (*d_write)(); (*d_sgtty)();
Read_lp
device 3
(*d_open)(); (*d_close)(); (*d_read)(); (*d_write)(); (*d_sgtty)();
35
Kernel tables after open(/a/b) (1/3) PA
user
(system) file table inode table / a
/
a
b
data block data block data block
b Device name
36
Kernel tables after open(/a/b) (2/3) PA
user
(system) file table inode table / a
/
a
b
data block data block data block
offset
b Device name
37
Kernel tables after open(/a/b) (3/3) PA
user
(system) file table inode table /
u_ofile fd = 4
0 1 2 3 4
a
/
a
b
data block data block data block
offset
b Device name
38
File descriptor table (or open file table)
An array in struct user ( u_ofile[] array ) per process open file information whenever program calls open(), create() fd = open(“/a/b”, …) (3) file descriptor (2) kernel is returned opens file
fd is integer (“file descriptor”), starts from 0, 1, 2 ..
(1) pathname of file the file to open
0, 1, 2 reserved for standard (input/output/error) file
fd is used as an index into
u_ofile[] array (file descriptor table, open file table_) starting point to access file (points to system file table) 39
Kernel data structure for file
PA
user per process (system) file table inode table u_ofile[] fd
0 1 2 3 4
offset
devswtab
inode
offset
routine device
inode
open file table file descriptor table ( “file handle” extends this notion to network. Window’s name)
40
Kernel Data Structure Process 1
devswtab
user offset
inode
read( )
CPU CPU
FX
41
(System) file table
Process 1
devswtab
user
One entry for each open/create/pipe
offset inode
read( )
CPU
may be shared (if offset is shared)
CPU FX
content
offset counter (number of processes sharing this entry) pointer to inode table r/w/p flag 42
Inode table
Process 1
devswtab
user
Changed less frequently (than offset) includes most of the information content (while in disk)
offset inode
read( )
CPU
protection mode owner size time array of pointers to disk data blocks
CPU FX
43
In core Inode
Process 1
devswtab
user
offset
content (while in disk) protection mode owner size time array of pointers to disk blocks
inode
read( )
CPU
CPU FX
plus (at load time)
counter (number of processes sharing) device name (major/minor device number) i-number(location of inode in disk) status (locked, mount point, …) 44
pointer array within inode
Now, you can reach any data block through in-core inode These pointers are stored in an array within inode
inode
inode
File content
File content File contentFile content
45
pointer array within inode
inode
File content
File content File contentFile content
46
pointer array within inode
Data
Block direct 0 direct 1 direct 2 direct 3 direct 4 direct 5 direct 6 direct 7 direct 8
Sector Address
direct 9 single indirect double indirect triple direct
47
pointer array within inode
Data
Block direct 0 direct 1 direct 2 direct 3 direct 4 direct 5 direct 6 direct 7
Sector
direct 8
Address
direct 9 single indirect double indirect triple direct
Fast for small files slower for big files
48
--
timesharing aplication
Offset vs Disk Block Data Block ~ 1KB
direct 0
~ 2 KB
direct 1 direct 2 direct 3 direct 4 direct 5
57821
direct 6
direct 7 direct 8 ~ 9 KB
direct 9
~109KB
single indirect
~ 10109 KB
double indirect
Sector Address
triple direct 49
Linux
1-12th pointer 13th pointer 14th pointer 15th pointer
– direct pointer – indirect pointer - doubly indirect pointer – triply indirect pointer
--------------Max 4096 GB file data if
block address - 32 bits block size - 4096 byte
50
Directory file
Directory file
it is also a file.
File name i-number
“a” 7
content:
“b” 1
“bin” 3
“dev” 772
i-number = 3 Q: file name – limit char? Q: # of files – limit
3rd inode In disk
Data blocks
51
Kernel tables before open(/a/b) (1/3) PA
user
(system) file table inode table /
inode data
/
a
b
data block data block data block
52
Kernel tables before open(/a/b) (2/3) PA user
file table
inode table /
data block
data block
data block
inode a bin x 7 11 8
data
/
a
b
data block data block data block
53
Kernel tables before open(/a/b) (3/3) PA user
file table
inode table /
data block
data block
data block
inode a bin x 7 11 8
a
data block
data block
data
/
a
b
data block data block data block
data block
b usr y 3 21 6 54
open(“/a/b”) (1/2)
/:
data i
data data
/a:
Directory
data i
data data
/a/b:
a x y bin dev 7 6 8 11 40
w u b 7 6 8
ch temp 11 40
Directory
data
i
data data
Content of this file
Regular File 55
open(“/a/b”) (2/2) /:
Inode 0
a bin dev
…
7 11
…
40
Inode of “a”
a:
b u
ch
…
8 6
11
…
Inode of “b”
File b’s Data blocks 56
open(“/a/b”, …) (1/2)
Kernel system call open( ) scans pathname
1st -- root directory file:
get inode 0 in disk inode space read data blocks of root directory file search for file name “a” get corresponding i-number for file “a”
2nd -- “a” file:
get inode of “a” from disk (also directory file) get data blocks of directory file “a” search for file name “b” get corresponding i-number for file “b”
/a/b a bin
7 11
dev 40
/a/b
57
open(“/a/b”, …) (2/2)
Kernel system call open( ) scans pathname (cont’d) /a/b
file “b”:
read inode of “b” from disk (regular file) ---- pathname ends here -------
set up kernel data structures for file “b”
insert inode into in-core inode table new entry in system file table (offset <= zero) new entry in u_ofile[] in user return file descriptor open( ) is done 58
Kernel tables after open(/a/b) (1/4) PA
user
(system) file table inode table / a
/
a
b
data block data block data block
b Device name
59
Kernel tables after open(/a/b) (2/4) PA
user
(system) file table inode table / a
/
a
b
data block data block data block
offset
b Device name
60
Kernel tables after open(/a/b) (3/4) PA
user
(system) file table inode table /
u_ofile fd = 4
0 1 2 3 4
a
/
a
b
data block data block data block
offset
b Device name
61
Kernel tables after open(/a/b) (4/4) PA
user
(system) file table inode table /
u_ofile fd = 4 returned
0 1 2 3 4
a
/
a
b
data block data block data block
offset
b Device name
Once you have fd, 62 you can access b’s inode after only 3 memory accesses
open(“/a/b”, …) again
open() is very -- disk accesses
once (open or create) is enough translate (pathname=> fd) once, save it do not use pathname in subsequent calls
read( ), write( ), close( ), …
use file descriptor instead
read(fd, ... ), write(fd, ... ),
63
C functions for file
Wait a minute …
I used printf(), scanf(), getchar() …. But never used read(), write() before …? I used *FILE …. But never used fd (file descriptor) before …?
Right, most people use library function
And library then invokes invokes system calls Remember? Library cannot perform I/O directly …. library functions are in my address space (user)
64
System calls for files create() open(), close() read(), write() lseek() stat()
move offset get inode content
All others are library functions
eg scanf(), gets(), getchar(), ….. 65
System call v.s. Library call in kernel system call
read()
in a.out (user) library call scanf() getchar()
format char
gets()
string
fsacnf() fgetc() fgets() fread()
fd
tty files
all files
any number
*FILE (struct in lib)
66
FILE vs fd Kernel a.out
User a.out
my code
library
user
FILE (
local buffer
main( )
add( ) sub( )
count ---- buf pointer -- buf
file descriptor }
fopen( ) printf( )
fd
(system) file table
inode table
u-ofile
/
0 1 2 3 4
a offset
b
/
a
data block data block
system call trap( )
When the local buffer (in FILE) becomes empty, Read() system call fills this buffer again
write()
67
Example: open 1. my a.out 2. fopen() 3. library
calls creates invokes
library fopen(“/a/b/c” ) struct FILE for /a/b/c system call open(“/a/b/c” )
kenel sets up kernel returns
tables (inode, user, .., u_ofile[]) file descriptor fd
fopen() saves fopen() returns
fd in *FILE (for future use) *FILE
4. my a.out saves 5. all future use
*FILE (for future use) getchar(*FILE) 68
Example: getchar() #include “syscalls.h” int getchar(void) /* library function -- copied into my a.out */ { static char buf[BUFSIZ]; /* library local buffer */ data structure static char *bufp = buf; /* pointer */ in library static int n =0; /* counter */ /* Is library local buffer empty? */ if (n == 0) {/* Yes, invoke read() system call & fill up local buffer*/ n = read (0, buf, sizeof(buf)); /* system call */ bufp = buf; } return(--n>0)? (unsigned char) *bufp++: EOF; /* return a character */ }
69
Functions for file handling
So, you usually use library… printf() for formating (such as %s, %d) getchar() for performance …. But all library I/O functions end up asking system call (Library functions are “ user” code & cannot do I/O directly) They are front-end and provide you with convenience, performance …
Many library functions may exist But there’s only one system call for read()
70
Summary
File System:
Transforms blocks into Files and Directories
Optimize for access and usage patterns
Maximize sequential access, allow efficient random access
File (and directory) defined by header, called “inode”
File Allocation Table (FAT) Scheme
Linked-list approach
Very widely used: Cameras, USB drives, SD cards
Simple to implement, but poor performance and no security
Look at actual file access patterns – many small files, but large files take up all the space!
71
Q&A
72