Virtual Memory in FreeBSD - Part 1

| Comments

In computing, virtual memory is a memory management technique that is implemented using both hardware and software. It maps memory addresses used by a program, called virtual addresses, into physical addresses in computer memory but it lacks code details.

Read More

Recently, I have come across a very good blog post about virtual memory which explains in theory what is virtual memory and talk a bit about code

The Design and Implementation of the FreeBSD Operating System is an excellent book and this is the book I am reading, following and learning.


This post is about the code level details of virtual memory explaining how the process address space is mapped and managed.

In this post, I will only be talking about user process address not about the kernel process address space. Both anyway uses same data structures.


Components of Process Address Space

  1. vm_map - Head of list of vm_map_entry
  2. vm_pmap - Machine dependent stuff resides
  3. stats - Paging Stats
  4. vm_map_entry
  5. vnode/object
  6. vm_page


vm_map_entry has a peice of address space it has start and end of virtual addresses.

vm_map_entry has pointer to vnode or object which could be a file, executable or swap object if it is mapping anon memory, it could be device if it mapping a frame buffer.

obj offset - Object offset is the value which says where in vnode/object we should start referencing. Lets say if vnode/object is executable file then we will skip the header part and start with where text begins.

The second vm_map_entry
This block is for Initialized area. This map will refer to the same vode/object(executable file) but at different place(offset) where initialized data resides in executable file. So first vm_map_entry block and second vm_map_entry block refers to the same vnode/object(executable) but at different section(offset) in the same file. If two processes are sharing memory then two process can refer to same offset. So, if one process changes something other can see the changes.

Third vm_map_entry
For anon objects like uninitialized data, stack it will point to vnode/object which is swap object. Initially, when system comes up we don’t allocate any space for swap object and lazy approach is followed. When the vnode/object (swap object) is touched first time zero base page is created and we write into it but when we start getting crunch in memory then the paging demon comes and find the space for swap object in hard disk and performs the paging.

vm_page
Each vm_page structure reference to each physical page in the machine. So if there is page of size 4K then there will be lot of pages in the system. For every 4K page there will be one vm_page structure so lets say if there are 100 pages in the system there will be 100 vm_page structure. It could consume lot of memory, so we try to keep structure vm_page as small as possible.

Every vm_pages is referenced by only one PART(not the whole vnode/object) of vnode/object. So looking at vm_page and it will say I am logical block number 100(or something) of this executable(vnode/object).

Pages are logical containers for process. So, when one process goes out and other comes in the same pages will be used by that process references section in vnode/object. So, vm_page can be referenced by multiple processes.

Now, in intialized area lets say we want to change some values in that case we will modifying the page but now same page can be referenced by multiple pages so, we create a copy of page called shadow object and we modify that page.

For vm_page there is a data structure called pvEntry. For each page pvEntry will map to physical address in vm_pmap area(machine dependent area).

We also maintain another data structure which maps all pvEntries associated with vnode/objects.

vnode/pbject -> vm_page1 ->pvEntry1
-> vm_page2 ->pvEntry2


-> vm_pageN ->pvEntryN

vnode/object -> pvEntry1


-> pvEntryN

Zero filled memory/Uninitialized object/Swap objects = Third vm_map_entry’s vnode/object



Code walk

Header files

  • vm header file - vm_param.h
  • vm header file - vm.h
  • vm header file - vm_map.h

VM protections
#define VM_PROT_NONE ((vm_prot_t) 0x00) #define VM_PROT_READ ((vm_prot_t) 0x01) #define VM_PROT_WRITE ((vm_prot_t) 0x02) #define VM_PROT_EXECUTE ((vm_prot_t) 0x04) #define VM_PROT_OVERRIDE_WRITE ((vm_prot_t) 0x08) /* copy-on-write */


vmspace

vmspace
struct vmspace {
     struct vm_map vm_map;     /* VM address map */
     struct shmmap_state *vm_shm;     /* SYS5 shared memory private data XXX */
     segsz_t vm_swrss;     /* resident set size before last swap */
     segsz_t vm_tsize;     /* text size (pages) XXX */
     segsz_t vm_dsize;     /* data size (pages) XXX */
     segsz_t vm_ssize;     /* stack size (pages) */
     caddr_t vm_taddr;     /* (c) user virtual address of text */
     caddr_t vm_daddr;     /* (c) user virtual address of data */
     caddr_t vm_maxsaddr;     /* user VA at max stack growth */
     int     vm_refcnt;     /* number of references */
     /*
      * Keep the PMAP last, so that CPU-specific variations of that
      * structure on a single architecture don't result in offset
      * variations of the machine-independent fields in the vmspace.
      */
     struct pmap vm_pmap;     /* private physical map */
};


vm_map

vm_map
struct vm_map {
     struct vm_map_entry header;     /* List of entries */
     struct sx lock;               /* Lock for map data */
     struct mtx system_mtx;
     int nentries;               /* Number of entries */
     vm_size_t size;               /* virtual size */
     u_int timestamp;          /* Version number */
     u_char needs_wakeup;
     u_char system_map;          /* (c) Am I a system map? */
     vm_flags_t flags;          /* flags for this vm_map */
     vm_map_entry_t root;          /* Root of a binary search tree */
     pmap_t pmap;               /* (c) Physical map */
     vm_map_entry_t deferred_freelist;
#define     min_offset     header.start     /* (c) */
#define     max_offset     header.end     /* (c) */
};


vm_map_entry

vm_map_entry
struct vm_map_entry {
     Store the vm_map_entry node in LL as well in the form of tree
     struct vm_map_entry *prev;     /* previous entry */
     struct vm_map_entry *next;     /* next entry */
     struct vm_map_entry *left;     /* left child in binary search tree */
     struct vm_map_entry *right;     /* right child in binary search tree */

     vm_offset_t start;          /* start address */
     vm_offset_t end;          /* end address */
     vm_offset_t avail_ssize;     /* amt can grow if this is a stack */

     vm_size_t adj_free;          /* amount of adjacent free space */
     vm_size_t max_free;          /* max free space in subtree */

     union vm_map_object object;     /* object I point to */
     vm_ooffset_t offset;          /* offset into object */

     vm_eflags_t eflags;          /* map entry flags */

     //Read, write, execute etc...
     vm_prot_t protection;          /* protection code */
     vm_prot_t max_protection;     /* maximum protection */
     vm_inherit_t inheritance;     /* inheritance */
     int wired_count;          /* can be paged if = 0 */
     vm_pindex_t lastr;          /* last read */
};


VM object types

VM objects
     enum obj_type { OBJT_DEFAULT, OBJT_SWAP, OBJT_VNODE, OBJT_DEVICE, OBJT_PHYS,
                                OBJT_DEAD };


vnode/object

vnode
struct vm_object {
     struct mtx mtx;
     TAILQ_ENTRY(vm_object) object_list; /* list of all objects */
     LIST_HEAD(, vm_object) shadow_head; /* objects that this is a shadow for */
     LIST_ENTRY(vm_object) shadow_list; /* chain of shadow objects */
     TAILQ_HEAD(, vm_page) memq;     /* list of resident pages */
     vm_page_t root;               /* root of the resident page splay tree */
     vm_pindex_t size;          /* Object size */
     int generation;               /* generation ID */
     int ref_count;               /* How many refs?? */
     int shadow_count;          /* how many objects that this is a shadow for */

     objtype_t type;               /* type of pager */
     u_short flags;               /* see below */
     u_short pg_color;          /* (c) color of first page in obj i.e. suerppage etc. */
     u_short paging_in_progress;     /* Paging (in or out) so don't collapse or destroy */
     int resident_page_count;     /* number of resident pages */
     struct vm_object *backing_object; /* object that I'm a shadow of */
     vm_ooffset_t backing_object_offset;/* Offset in backing object */

     TAILQ_ENTRY(vm_object) pager_object_list; /* list of all objects of this pager type */
     LIST_HEAD(, vm_reserv) rvq;     /* list of reservations */
     vm_page_t cache;          /* root of the cache page splay tree */
     void *handle;
     union {
          /*
           * VNode pager
           *
           *     vnp_size - current size of file
           */
          struct {
               off_t vnp_size;
          } vnp;

          /*
           * Device pager
           *
           *     devp_pglist - list of allocated pages
           */
          struct {
               TAILQ_HEAD(, vm_page) devp_pglist;
          } devp;

          /*
           * Swap pager
           *
           *     swp_bcount - number of swap 'swblock' metablocks, each
           *               contains up to 16 swapblk assignments.
           *               see vm/swap_pager.h
           */
          struct {
               int swp_bcount;
          } swp;
     } un_pager;
};


vm_page (Least active use algorithm)

vm_page
struct vm_page {
     TAILQ_ENTRY(vm_page) pageq;     /* queue info for FIFO queue or free list (P) */
     TAILQ_ENTRY(vm_page) listq;     /* pages in same object (O)      */
     struct vm_page *left;          /* splay tree link (O)          */
     struct vm_page *right;          /* splay tree link (O)          */

     vm_object_t object;          /* which object am I in (O,P)*/
     vm_pindex_t pindex;          /* offset into object (O,P) */
     vm_paddr_t phys_addr;          /* physical address of page */
     struct md_page md;          /* machine dependant stuff */
     uint8_t     queue;               /* page queue index */
     int8_t segind;
     u_short     flags;               /* see below */
     uint8_t     order;               /* index of the buddy queue */
     uint8_t pool;
     u_short cow;               /* page cow mapping count */
     u_int wire_count;          /* wired down maps refs (P) */
     short hold_count;          /* page hold count */
     u_short oflags;               /* page flags (O) */
     u_char     act_count;          /* page usage count, Page active count */
     u_char     busy;               /* page busy count (O) */
     /* NOTE that these must support one bit per DEV_BSIZE in a page!!! */
     /* so, on normal X86 kernels, they must be at least 8 bits wide */
#if PAGE_SIZE == 4096
     u_char     valid;               /* map of valid DEV_BSIZE chunks (O) */
     u_char     dirty;               /* map of dirty DEV_BSIZE chunks */
#elif PAGE_SIZE == 8192
     u_short     valid;               /* map of valid DEV_BSIZE chunks (O) */
     u_short     dirty;               /* map of dirty DEV_BSIZE chunks */
#elif PAGE_SIZE == 16384
     u_int valid;               /* map of valid DEV_BSIZE chunks (O) */
     u_int dirty;               /* map of dirty DEV_BSIZE chunks */
#elif PAGE_SIZE == 32768
     u_long valid;               /* map of valid DEV_BSIZE chunks (O) */
     u_long dirty;               /* map of dirty DEV_BSIZE chunks */
#endif
};


Page fault reasons

Page fault reasons
#define PGEX_P          0x01     /* Protection violation vs. not present */
#define PGEX_W          0x02     /* during a Write cycle */
#define PGEX_U          0x04     /* access from User mode (UPL) */
#define PGEX_RSV     0x08     /* reserved PTE field is non-zero */
#define PGEX_I          0x10     /* during an instruction fetch */


Memory Allocators

uma.h - Universal memory allocator


Call graph mmap


mmap system call

Maps files or devices into memory
mmap() creates a new mapping in the virtual address space of the calling process. The starting address for the new mapping is specified in addr. The length argument specifies the length of the mapping.

mmap
struct mmap_args {
     void *addr; ===> File segment address needs to be placed in address space
     size_t len;
     int prot; // Protection
     int flags;
     int fd;
     long pad;
     off_t pos;   // We can map some part of file into memory
};

Function mmap(td, uap)
     struct thread *td;
     struct mmap_args *uap;
{

     addr = (vm_offset_t) uap->addr;
     size = uap->len;
     prot = uap->prot & VM_PROT_ALL;
     flags = uap->flags;
     pos = uap->pos;

     ...
     Sanity checking of inputs
     ...

     If we are mapping in uninitialized area
     ...

     If we are mapping in swap area
     ...

     If we are mapping file area
     get file pointer
     if ((error = fget(td, uap->fd, &fp)) != 0) {
          ...
     }

     ...
     ...

     Vode associated with file pointer
     vp = fp->f_vnode;

     Assign memory protection i.e. read, write, execute etc.
     if (vp->v_mount != NULL && vp->v_mount->mnt_flag & MNT_NOEXEC)
               maxprot = VM_PROT_NONE;
          else
               maxprot = VM_PROT_EXECUTE;
          if (fp->f_flag & FREAD) {
               maxprot |= VM_PROT_READ;
          } else if (prot & PROT_READ) {
               error = EACCES;
               goto done;
     }

     For shared memory, protection should be write
     if ((flags & MAP_SHARED) != 0) {
               if ((fp->f_flag & FWRITE) != 0) {
                    maxprot |= VM_PROT_WRITE;
                    ...
               }
          ...
     }

     handle = (void *)vp;
     handle_type = OBJT_VNODE;
     td->td_fpop = fp;

     error = vm_mmap(&vms->vm_map, &addr, size, prot, maxprot,
                        flags, handle_type, handle, pos);

     ...
     td->td_retval[0] = (register_t) (addr + pageoff);
     ...                       

}


mmap is just a wrapper around vm_mmap()

vm_mmap
Function vm_mmap()
{
     ...
     Sanity Checking
     ...
     ...

     switch (handle_type) {
          case OBJT_DEVICE:
               ...
               ...
          case OBJT_VNODE:
               error = vm_mmap_vnode(td, size, prot, &maxprot, &flags,
                                          handle, foff, &object);
               break;
          case OBJT_SWAP:
               error = vm_mmap_shm(td, size, prot, &maxprot, &flags,
                                        handle, foff, &object);
    
               ....
     }

          Our case is OBJT_VNODE, vp = handle
          Function vm_mmap_vnode() {
               ...
               ...
               Get a ref on vnode
               vfslocked = VFS_LOCK_GIANT(mp);
               if ((error = vget(vp, LK_SHARED, td)) != 0) {
                    ..
               }

               obj = vp->v_object;
               type = OBJT_VNODE;
               handle = vp;
               ...
               ...

               Getattr for vnode
               error = VOP_GETATTR(vp, &va, cred)
               ...
               ...
               obj = vm_pager_allocate(type, handle, objsize, prot, foff);
                         Function vm_pager_allocate() {
                              This function allocates the instance of a pager of given type
                              ...
                              ...
                              ops = pagertab[type];
                              /*
                                   struct pagerops *pagertab[] = {
                                        ...
                                        &vnodepagerops,          /* OBJT_VNODE */
                                        ...
                                   }

                                   struct pagerops vnodepagerops = {
                                        .pgo_alloc =     vnode_pager_alloc,
                                        ...
                                        ...
                                   }
                              */

                              if (ops)
                                   ret = (*ops->pgo_alloc) (handle, size, prot, off);

                                   //So we are going to call vnode_pager_alloc()
                                        Function vnode_pager_alloc()
                                        {
                                             //Allocates pager for a vnode
                                             vp = (struct vnode *) handle;

                                             ...

                                             object = vm_object_allocate(OBJT_VNODE, OFF_TO_IDX(round_page(size)));
                                             ...
                                             vref(vp);
                                        } //end of vnode_pager_alloc()
                              }
                         } // end of vm_pager_allocate()

               vput(vp);
               VFS_UNLOCK_GIANT(vfslocked);

          } //end vm_mmap_vnode()


     So far we have allocated vm_page, vnode. Now, we need to allocate vm_map_entry and put it in
     appropriate data structure
     if (flags & MAP_STACK)
          rv = vm_map_stack(map, *addr, size, prot, maxprot,
              docow | MAP_STACK_GROWS_DOWN);
     else if (fitit)
          // Try to find it above tje addr
          rv = vm_map_find(map, object, foff, addr, size,
              object != NULL && object->type == OBJT_DEVICE ?
              VMFS_ALIGNED_SPACE : VMFS_ANY_SPACE, prot, maxprot, docow);
     else
          rv = vm_map_fixed(map, object, foff, *addr, size,
                     prot, maxprot, docow);

     ...
     ...
     ...
}


vm_map_find finds an unallocated region in the target address map with the given length. The search is defined to be first-fit from the specified address; the region found is returned in the same parameter.

vm_map_find
Function vm_map_find(vm_map_t map, vm_object_t object, vm_ooffset_t offset,
                          vm_offset_t *addr,     /* IN/OUT */
                          vm_size_t length, int find_space, vm_prot_t prot,
                          vm_prot_t max, int cow)
{
     Find space in a vm_map

     start = *addr;

     vm_map_findspace(map, start, length, addr)

               Find the first "fit" (lowest VM address) for "length" free bytes
               beginning at address >= start in the given map. This function returns
               the address (addr). This addr is not associated with vm_map entry.
              
               Function vm_map_findspace(vm_map_t map, vm_offset_t start, vm_size_t length,
                                               vm_offset_t *addr)
               {
                    We have min/max of VM addresses, request must be within that range

                    For the very first time when process boots up,
                    there are no blocks in VM address space for process
                    if (map->root == NULL) {
                         *addr = start;
                         goto found;
                    }

                    ...
                    ...
                    Find the entry into tree where we can fit our entry into vm_entry_map
                         while (entry != NULL) {
                              if (entry->left != NULL && entry->left->max_free >= length)
                                   entry = entry->left;
                              else if (entry->adj_free >= length) {
                                   *addr = entry->end;
                                   goto found;
                              } else
                                   entry = entry->right;
                         }
                         ...
                         ...
               } // end of vm_map_findspace()
     ...
     ...

     Insert into vm_map (vm_map.c)
     result = vm_map_insert(map, object, offset, start, start +
                                length, prot, max, cow);

     So, we have pager, address in the vm_map list where we can we can insert vm_map_entry
     but we have not yet created allocated vm_map_entry. So, function vm_map_insert()
     will create the vm_map_entry and insert into vm_map

          // Here start will have the addr value
          Function vm_map_insert(vm_map_t map, vm_object_t object, vm_ooffset_t offset,
           vm_offset_t start, vm_offset_t end, vm_prot_t prot, vm_prot_t max,
           int cow)
         {
              ...
              Sanity Checking
              ...

              Lookup entry in the tree. Since, we already have address in the tree all we are doing
              here is given the address find where we are in that respect. Are we ahead or behind?
             
              vm_map_lookup_entry(map, start, &temp_entry)

              ...
              ...

              If we have request of extending the object/vnode. Lets say if we malloc then we need to
              extend the initialized area. Function vm_object_coalesce() will try to grow the object
              if (.... || vm_object_coalesce(...)) {
                   ...
              }

              //Create a new vm_map_entry
              new_entry = vm_map_entry_create(map);

              Initialize the new_entry
              new_entry->start = start;
               new_entry->end = end;

               new_entry->eflags = protoeflags;
               new_entry->object.vm_object = object;
               new_entry->offset = offset;

               //Since its not a stack so size is 0
               new_entry->avail_ssize = 0;

               new_entry->inheritance = VM_INHERIT_DEFAULT;
               new_entry->protection = prot;
               new_entry->max_protection = max;
               new_entry->wired_count = 0;

               //Inser new vm_map_entry into list
               vm_map_entry_link(map, prev_entry, new_entry);

               //Enter the vm_map_entry in pmap which is machine dependent part
               vm_map_pmap_enter(map, start, prot,object, OFF_TO_IDX(offset), end - start,
                                     cow & MAP_PREFAULT_PARTIAL);

         } //end vm_map_insert()

} //end vm_map_find


One lock for active & inactive list
Active list
Inactive list

One lock for cache and free list
Cache list - Pages moves from Inactive list to cache list
Free list - Pages moves from Cache list to free list





FreeBSD, Operating Systems

« Signal code in FreeBSD

Comments