1. Overview: | |
no interrupts, no devices, no io | |
tasks are goroutines | |
2. syscall: | |
sentry can run in non-root(ring0) and root(ring3). | |
userapp‘s syscall are intercepted like normal guest, and handled by sentry kernel(in non-root mode iif the syscall can be | |
handled without sentry call syscall on host) | |
sentry kernel‘s syscall always executed in root(ring3) mode. sentry kernel‘s syscall finally execute HLT, which causes | |
VM exit. and then in root mode. | |
basic flow: | |
bluepill loop util bluepillHandler + setcontext return, now in non-root mode. | |
user app -> syscall -> sysenter -> user: -> SwitchToUser returns -> syscall | |
handled by sentry kernel(t.doSyscall()), in non-root mode. | |
sentry kernel -> syscall -> sysenter -> HLT -> vm exit -> t.doSyscall() [in root | |
mode]. | |
3. memory: | |
physical memory: | |
size of physical memory almost equals 1 << cpu.physicalbits, but might be smaller because of reserved region,etc. | |
vsize - psize part not in physicalRegion. gva <-> gpa, ie | |
guest pagetable, maps almost all gva <-> gpa, but gpa <-> hva(hpa) | |
is only set for sentry kernel initially. Then gpa page frame | |
is filled by HandleUserFault(from filemem or HostFile) each | |
time there is ept fault.. | |
pagetables: | |
gvisor itself is mapped in root and non-root mode, and the gva == hva. So, sentry runs in userspace address space | |
in root ring3 mode, also run in userspace address space in non-root ring0 mode. | |
user app: userspace address space(lower part of 64bits address) <--> gpa | |
kernelspace address space(higher part of 64bits address), which actually | |
is sentry kernel userspace address with 63th bit set <--> gpa. This | |
map is almost useless, maybe only for pagetable switch and some setups. | |
we cannot run sentry on this range of address..(even | |
PIC cannot work, since PIC will be resolved once, not everytime when | |
hits). | |
sentry kernel: userspace address space, which is the userspace address on host. | |
so, gva actually equals hva. then gva <-> gpa <-> hva. | |
kernelspace address space is hva with 63th bits set <--> gpa. gpa <--> hva(hpa) | |
is set using ept. Again, gpa <--> hva is set up for sentry kernel initially. All subsequent | |
are handled by EPT fault, which eventually causes HandleUserFault(). | |
From here, we can see, for each user app syscall, there is pagetable switch. | |
somewhat similary to KPTI. but the pagetable is very different. | |
Since user app and sentry kernel‘s pagetable probably overlap(use the same userspace address space), they cannot be | |
mapped at the same time. when syscall, switch to sentry kernel‘s pagetable, there | |
is no map of user app in the table.. it causes access to user memory complicated.. | |
(This is why usermem is needed...). unlike linux, kernel‘s pagetable is superset | |
of user process‘s pagetable, so kernel can access user memory convieniently. | |
The access to userapp‘s memory from sentry kernel(for example, write syscall for userapp, sentry kernel | |
have to copy data from userapp‘s memory address space). How to find the sentry kernel‘s addr according to the userapp‘s | |
addr? Basically, Walkthrough userapp‘s pagetable to get uaddr --> gpa, Or walk userapp‘s vma to findout | |
uaddr -> file + file offset, the walk userapp‘s address_space to findout file +file offset -> gpa. Then sentry | |
knows gpa -> hva(it itself maps all the memory, stores the mapping), gets hva.. In sentry, gva == hva, no matter | |
sentry in root or non-root, both ok to access this hva. | |
Filesystem: | |
The thin vfs is in sentry, like linux. Also has limited proc and sys. gofer only for 9pfs. | |
From code path, all file operations go through 9p server, However From log, ther is no Tread/Twrite message in | |
9p server. Topen/Tclunk go through 9p server, assume | |
that read/write directly to host file, probably fd passed by unix domain socket. | |
Network: | |
receive via go routine, tx via endpoint.WritePacket. | |
Summary: | |
shortcomings: compatibility, unstable, syscall overhead. eg, mount command causes sudden exit of gvisor, ip command | |
cannot run, SO_SNDBUF socket option not supported.. | |
merits: small memory footprints. physical memory be backed up by memfd/physical file(somehow like dax). on demand | |
memory map, not fixed for the beginning. |
原文:https://www.cnblogs.com/dream397/p/14270544.html