[Lec 1] Introduction & Examples

好的 OS 处理好以下难以调和的目标

  • efficient & well-abstracted
  • powerful & simple api
  • flexible & secure

注意检查 system call 的返回,判断error

fd=open('filename',0) 返回当前未使用的标号最小的file descriptor,0 是默认的打开方式,另外有 O_CREAT O_APPEND O_RDONLY O_WRONLY O_RDWR;允许出现 O_CREAT|O_RDWR

pidof:提供程序名,返回它的pid

(in real unix) /proc/pid_of_process/ 中储存了文件的系统层信息,fd 文件夹中的链接即为 file descriptor 及它们指向的实体。

inode

inode 是 unix 系统中对磁盘上文件的信息储存方式,使用 stat 可以查看一个文件名映射到的 inode

1
2
3
4
5
6
7
8
9
$ stat README.md
File: README.md
Size: 113 Blocks: 8 IO Block: 4096 regular file
Device: 820h/2080d Inode: 434332 Links: 1
Access: (0755/-rwxr-xr-x) Uid: ( 1000/ xing) Gid: ( 1000/ xing)
Access: 2022-02-22 23:20:21.504782371 +0800
Modify: 2022-02-22 21:09:04.114819990 +0800
Change: 2022-02-22 21:09:04.114819990 +0800
Birth: -

通过 ln srcfile destfile 创造硬链接,destfileInode 标号与 srcfile 相同,文件系统中这两个文件名都不过是指向磁盘中同一块区域的链接。当指向一个 Inode 的链接数降到 0 时,操作系统将文件从磁盘上抹去。

inode 的设计本质上是对文件的抽象,是一种 isolation。

[Lec 3] OS Organization

内核态、用户态:通过一个 flag 来区分,flag=0为内核态,flag=1为用户态

应用程序通过调用 ecall <n> 来指定要进行的系统调用,并在此地进入内核继续执行

kernel/syscall.h 指定了 n 与 syscall 的对应关系

使用 riscv64-linux-gnu-objdump -d exename 来查看 exename 的汇编码

kernel 通过计时器进行 interrupt,来切换 cpu 上运行的进程

kernel = trusted computing base

  • kernel is bugfree
  • kernel treats all processes as malicious

in kernel/syscall.c
With designated list initialization (ISO C99), explicit initialization of the array members is possible.
The compiler can deduce the length of the array for you, this can be achieved by leaving the square brackets empty.

1
2
int array[5] = {[2] = 5, [1] = 2, [4] = 9};
/* array is {0, 2, 5, 0, 9} */

qemu-system-riscv64: terminating on signal 15 from pid xx (make) make 收到终止指令,于是向模拟器发送终止指令

[Lec 4] Page Tables

each process has its own memory map va->pa
the map is stored in memory
MMU looks at memory and does the translation
the register satp in CPU points where the page table is store
satp is maintained by kernel, only modified by kernel

Index(27b) + offset(12b) -> PPN(44b) + offset(12b)

*PPN: physical page number

satp -> address of top level page table

Index -> 9b + 9b + 9b
-> key of top level page table, entries in the table are called PTE

top level page table

the table’s size is $2^9*(64/8)=2^12=4096$, occupies one whole page
one PTE constitutes of one PPN of the next level page table, takes 64b
we cannot depend a translation service on top another translation service
the PTEs (should) have last 12b set 0, but lowest 10 bits are used for translation control

from high(9) to low(0) are:

  • RSW(2): reserved for superviser software
  • D(1) : dirty
  • A(1) : Accessed
  • G(1) : Global
  • U(1) : User - accessible by process running in user space
  • X(1) : Executable - executing instructions from it allowed
  • W(1) : Writable - writing to page allowed
  • R(1) : Readable - reading from page allowed
  • V(1) : Valid - it’s a valid PTE for translation
    PTE not assigned before use

TLB: translation lookaside buffer

to reduce the three memory access
caches [VA,PA] mapping
On switching to another process’s pagetable, the OS tells the TLB to flush itself (using sfence_vma). on other occations, OS & TLB don’t communicate (CPU communicate with TLB).

Kernel lives in [KERNBASE,PHYSTOP)=[0x80000000,?)

kernel starts the first page table, to keep simple, this page table is mostly an identity mapping (VA==PA), mapping is completely identical on KERNBASE~PHYSTOP

PA >= 0x80000000: index to DRAM
PA < 0x80000000: communication with other hardware on chipset

[Lec 5] RISC-V Calling Conventions

GDB

  • tui enable
  • layout {split,asm,src}
  • p $pc
  • p /x *argv@argc
  • x /2c $a1 (print 2 characters)
  • x /6i 0x3.... (print 6 instructions)
  • info reg
  • apropos cmd look in manual

<Ctrl-a c>: goes into qemu console

  • info mem: print the page table specified by satp

[Lec 6] Trap

从用户态转到内核态时,内核将32个 User Registers 直接保存下来。
Supervisor mode 的权限

  • 读写 Control Registers
    SATP, STVEC, SEPC, SSCRATCH
  • 调用没有设置 PTE_U flag 的 PTE

不能

  • 读写任意物理地址(必须通过PTE)
  • 调用设置 PTE_U flag 的 PTE

Trap 流程:

  • ecall (不切换页表,切换到管理者模式,将用户的[pc]保存在[sepc]中)
  • uservec (trampoline 中的汇编函数)
  • usertrap[sepc] 储存在 trapframe
  • syscall()
  • sys_xxx()
  • syscall()
  • usertrapret()
  • userret

[Lec 8] Page Faults

页面映射:

  • 静态:kernel 在启动与 fork 时操作页表
  • 动态:页面错误

kernel 响应 page fault 时需要哪些信息?

  1. 出错的虚拟地址
    page fault 采用 trap 机制,将访问失败的虚拟地址写在 [stval]
  2. page fault 的错误类型
  3. 造成 page fault 的指令的虚拟地址,它被储存在 [sepc] 中,随后被纯存在 trapframe::epc 中。
    修复错误后,需要重新执行该指令

sbrk(): eager allocation, kernel allocates physical memory on demand

lazy allocation

应用程序倾向于申请超过自身需要的内存空间,因此可以使用 lazy allocation:用户调用 sbrk() 时只增大 p->sz,不进行映射,当用户访问这些页面造成 page fault 时再利用trap进行分配。

zero fill on demand

程序的 BSS 段保存了初始化为 0 的全局变量,为了节省初始化开销,将这一段虚拟地址映射到同一页被清零的物理页上,且将该页设为只读;程序试图写入时造成 page fault,内核再分配物理页面并映射

copy-on-write fork

有的程序在 fork() 后直接调用 exec(),这使得 fork() 对父进程的完全复制显得十分浪费,因此可以改为 将子进程的虚拟地址直接映射到父进程的物理地址上 而不是分配新的物理地址、将父进程物理地址的内容拷贝过去,再映射