CS 161: Operating Systems Spring 2017

Tuesday/Thursday 1:00-2:30

Pierce 301

James Mickens


Home Syllabus Assignments Resources Piazza

CS161 Assignment 4: File Systems

In-class Design Review: Tuesday, April 11, 2017
Design Due: Thursday, April 13, 2017 at 5:00PM
Assignment Due: Wednesday, April 26, 2017 at 5:00PM (coursewide extension until Friday, April 28, 2017 at 5:00PM)

Objectives
Introduction

Up until this point, your OS/161 started anew every time it booted, blissfully unaware of any mayhem it may have unleashed on itself the last time it was running. Obviously, things in the real world aren't so clean. Operating systems must maintain state across boots using some kind of persistent storage. Often such storage is provided by disks, and the abstraction that operating systems usually provide for accessing a disk is a file system.

OS/161's native filesystem is called SFS (Simple File System). Though a relatively naive implementation, SFS boasts most of the basic features you would expect out of a filesystem:

All these and more are provided for you, free of charge, with the code you got at the beginning of the semester. So what's left for an ambitious CS161 student to do?

We're glad you asked! SFS has one glaring problem: it isn't resilient to crashes. All writes are buffered and are written to disk at an arbitrary time and in an arbitrary order. If OS/161 were to unexpectedly crash in the middle of operations (incredibly unlikely, we know, but bear with us), the data on disk could easily be left in an inconsistent state, rendering the filesystem unusable and with arbitrary amounts of data lost to eternity. That would be Bad™.

Your job is to modify SFS to be resilient in the face of crashes. This will entail two interrelated tasks:

We provide an on-disk journal implementation; you define the journal records to be used, the code that issues them, and the code that interprets them at recovery time and makes whatever changes are necessary to recover the operations described.

Additionally, as with anything you implement, you should test your system as thoroughly as possible. We would like you to turn in, as part of your final submission, whatever automated tests you develop to verify your system is working correctly.

(If you would like a refresher on file system consistency, Eddie Kohler gave a fabulous colloquium on this topic that can be viewed online here. We recommend reviewing the first part of this talk for a lucid discussion of file system crash recovery. However, we do not expect anyone to implement Featherstitch for this assignment!)

Setup

Before you begin, you need to do some merging. We are providing you with implementations of several more system calls; you will need to pull this code and merge it. We've pushed the changes to the class distribution repository on github on a branch called a4-syscalls. To grab them, do:

    git fetch handout
    git merge handout/a4-syscalls
(If you usually rebase, we recommend not doing so here.) The merge mostly adds new system call code in the file kern/syscall/more_syscalls.c, but it changes some other files as well. You will likely have merge conflicts. Hopefully, sorting them out should for the most part be straightforward, but it will depend on how extensively you've changed your code, and where.

The likely causes of merge conflicts:

Once you fix the conflicts, you can commit the merge. However, you're not done yet. If you look inside more_syscalls.c, you'll see the following functions:

These functions are completely implemented for you. But if you look at the bottom of the file, you'll see that the implementations of sys_getdirentry, sys_fstat, sys_fsync, and sys_ftruncate depend on the file table, and in particular they use the solution set's file table interface and not your own from assignment 2. Therefore, they won't compile yet; you will need them to modify them to work with your file table.

This should be fairly straightforward to do. The file table interface they use is documented in a comment at the top of the file; your file table and open file code supports the same functionality and the same concepts, so you should be able to just replace the calls and structure fields used with your versions, or write fairly thin wrapper functions. If you don't want to deal with this up front it's fine to stub out these calls until you get far enough with the assignment that you need them working. (But do make sure this happens before you submit.)

You shouldn't have to change the other functions in the file at all, as they don't involve file handles and use only VFS-level functions (and copyin/out functions) that you most likely haven't touched. However, you should take the time to make sure you understand how all the new functions work.

Now, config an ASST4 kernel, run bmake depend, and build it. Make sure it still boots.

You are now ready to begin assignment 4. Commit everything and tag your repository asst4-start. Please tag only after committing the big merge, so that we don't see it as part of your diff.

Existing Infrastructure

Up until this point, you have not used SFS. Instead, you have been using emufs (the EMUlator File System). emufs amounts to little more than a pass-through to the underlying Linux filesystem on which sys161 runs. In fact, you probably noticed by now that all the user-level binaries get installed to the directory that you specified with the --ostree parameter when you ran configure. emufs simply accesses those files on the host system.

Beginning now, however, we will want to use a real filesystem. In our case, that filesystem is SFS. Fortunately for us, this switch will be transparent to the rest of the operating system because we are using the filesystem-independent VFS layer. The VFS layer takes high-level requests and dispatches them to the appropriate filesystem-specific code.

There are two main source directories that concern you:

You will want to begin by acquainting yourself with both sets of code.

Code Overview
Also take note of design/jphys.txt, which explains the physical journal code provided in sfs_jphys.c.

Code reading questions

  1. What happens (in broad terms) if sys_remove is called on a file that is currently open by another running process? Will a read on the file by the second process succeed? A write? Why or why not? (1 point)
  2. Describe the control flow, starting in the system call layer and proceeding through the VFS layer to reach SFS, that occurs for each of the following system calls. You need only trace the names of the functions that are called. Feel free to skip secondary or minor code paths that don't lead into SFS. (1 point)
    1. SYS_open
    2. SYS_write
    3. SYS_mkdir
  3. Now similarly describe the control flow within SFS for each of the same operations. Follow how we get to logic that makes specific updates to on-disk data structures (stopping where buffer-handling code takes over), and feel free to skip code paths that don't do so. (2 points)
  4. And finally, for mkdir, describe what happens to the block containing the newly allocated directory inode from the time the logic you've described in #3 ends to the time the inode eventually reaches the disk device. Trace the code that handles it, at a similar level of detail: function names and a very brief description of what happens at each point is enough. (2 points)
  5. Which of the following lock acquisitions are safe? (1 point)
    1. lock a vnode, lock the vnode table
    2. lock a vnode, lock the freemap
    3. lock the physical journal, get (lock) a buffer
    4. lock foo, lock foo/bar
    5. lock foo, lock foo/..
  6. Name two problems that the current implementation of sfsck can repair, and one that it cannot. (1 point)
  7. Describe four different types of inconsistent states the filesystem could end up in after a crash, and give a sequence of events that could lead to each. You will probably want to explain in your design doc how you handle these problems. (2 points)
Using SFS

We have prepared a separate handout with a primer on using SFS in OS/161.

Before you start coding, it's probably a good idea to create an SFS volume, mount it, and run some tests on it. Everything should work correctly provided you don't crash the system. If it doesn't work, better to hammer out the problems now rather than after you have the hood open and all the guts of SFS strewn around. The fs tests from the OS/161 boot menu should all run; so should the file- and directory-related tests from testbin/. This includes the stress tests like dirconc and psort.

Also note that a volume that is unmounted before shutdown should be reported clean by sfsck. We strongly recommend that you make sure that when you don't crash your system, the file system tests run, a cleanly unmounted file system does check cleanly and that all is right with the world. If you do not do this, should you later encounter bugs, you won't know whether they were there before you started making modifications for this assignment.

The userland SFS utilities are compiled both for OS/161 and for the System/161 host OS (Linux in the appliance); the host versions can be run from the hostbin directory under $(OSTREE). This is often useful for scripting tests; it can also occasionally be useful for debugging them. If (read: most likely when) you need to adjust these tools, make sure both the host and native versions continue to work.

Making SFS Crash-Resilient

As we said before, SFS is not currently resilient to crashes. While it does employ fine-grained locking to enable safe concurrent operation, no care is taken regarding when, how, and where different structures are written to disk. Your job is to modify SFS so that it can reliably recover from a crash. Specifically, you will turn SFS into a journaling filesystem: changes to the filesystem are written to a journal using write-ahead logging (WAL) so that the filesystem can always be recovered to a consistent state following a crash.

As discussed in class, there are multiple ways to do journaling. You can journal whole blocks; that is, every time you change something you write the whole updated block involved to the journal. This is called "physical block journaling" or sometimes just "physical journaling" and it's how e.g. the Linux ext3 file system works. In some ways it is simple; but it is also not especially efficient and runs into problems if you ever want to journal anything that isn't just an updated copy of a disk block.

The alternative to journaling whole blocks is to write small journal records, which is known as "record journaling". Each record describes a single change; this avoids writing out a lot of redundant data and increases efficiency. Also, it's possible to issue records that describe multiple related changes at once, or things that aren't strictly speaking changes to on-disk structures like checksums of user data. For these reasons record journaling is generally preferred to block journaling.

The content actually appearing in journal records can span a wide continuum, from very high-level records describing high-level actions (e.g. "Remove file F, with inode number I and prior link count N, from directory D") to very low-level records describing low-level modifications (e.g. "change the word at offset 36 in block 1234 from 15 to 18"). These extremes are sometimes referred to as "logical record journaling" and "physical record journaling". Logical record journaling typically yields smaller journals and smaller I/O footprints; however, in practice it's substantially harder to get right and leads to substantially more complex recovery code. Because recovery code is difficult and hard to debug under the best of circumstances, pragmatic implementations tend to lean towards physical records.

In this assignment you will be implementing record journaling, and while you get to pick the exact set of log records you use, we encourage/expect/require you to stick to low-level physical records.

You will need to:

How you do this is entirely up to you, but there are a few ground rules:

Note that even when doing low-level record journaling, the journal records can be designed at a wide variety of levels of abstraction, as low-level changes to specific bytes or integers, e.g., "change bit 546 of block 4 from 0 to 1", or logical updates to particular data structures, e.g., "allocate inode 546." What records you have, what information they contain, and how they're arranged for recovery is yours to design. (If you find yourself wanting to issue single records for whole file system operations, like "create a directory", you are heading into logical record territory and we advise a different course.)

As implied by the rules above, you do not need to journal user data. This is a fairly typical behavior for a journaling file system.

We provide you with code for an on-disk journal that manages records as arbitrary blobs. You may change this code in the course of doing the assignment; but ideally you will not need to. (If you find that you do want to change the on-disk journal code, please contact the course staff; it probably means either we didn't explain something adequately or it means something about the code should be improved for next time, and either way we'd like to know about it.) Your code will generate those blobs and handle them at recovery time, and take care of knowing what types of records exist and what they mean.

Interfacing to the physical journal

The physical journal code is documented in design/jphys.txt. We aren't going to repeat that material here, so please read the file, and if you have questions please ask.

That said, the interface to the physical journal is divided into roughly three parts:

Code to call the startup and shutdown functions has already been added to SFS for you; you will find this code (and the places you should insert calls to your own recovery routines) in the mount and unmount logic in sfs_fsops.c. Your recovery code will use the reading interface to scan the journal. You can scan forwards or backwards, and as many times as you wish, as needed by your recovery plan.

The writing interface can also be subdivided as follows:

You will be writing the code for adding records to the journal, and you also need to write code that requests flushing the journal to enforce write-ahead logging. You must also use flush and trim to provide checkpointing, so that the journal does not grow without bound.

You will find that two hooks have already been added to sfs_writeblock (which is the interface used by the buffer cache to write out dirty filesystem blocks) to make sure that the journal itself gets written out in order and to keep the journal code informed of which journal blocks have been written out. Unless we have blundered badly somewhere you should not need to change those hooks or add more calls to the functions they use.

You will find that sfs_jphys_write accepts a callback function. This allows you to update data structures using the LSN of a new record before the journal lock inside jphys is released. The reason for this is discussed in jphys.txt (and will be talked about in section) — it is possible you won't need this hook, but be aware that it's there and of why it's provided, and that if you use it this creates an ordering requirement between the journal lock and any locks you use in the callback.

Limits of the physical journal code

The journal code has some limits that you should be aware of:

Also, there is currently no logic in the journal code to detect if the on-disk journal head (the end new records are written to) runs into the on-disk journal tail (the oldest record still alive and not trimmed away). This is a bug, or at least a fairly significant missing feature. Without such logic, if this happens you will know at recovery time because the physical journal code won't load the journal and will complain about finding no trim record or not being able to find the tail. As long as you are making regular checkpoints this should not be a serious problem.

Some things to know when debugging

Some useful tidbits:

Testing (or: The Doom Counter and You)

Testing recovery code is difficult, in part, because it entails being able to anticipate everything that could ever possibly go wrong, which is next to impossible. To help you, we have helpfully added a feature to System/161 we like to call the "Doom Counter." Every time a write is started on a disk device, the counter is decremented. If the value of the counter ever falls to 0, the machine trips over its power cord and pulls it out of the wall, crashing your system. Thus you have a convenient way to purposefully crash your system at arbitrary (but interesting) points in time. The -D option to sys161 allows you to set the Doom Counter; be sure to test with lots of different values.

Using frack

The frack test is a tool for playing back simple test workloads. It is specifically intended for testing recovery. The basic model is as follows:

You run it as follows, using the doom counter to trigger crashes:

   frack do workload
   (crash, reboot, run recovery)
   frack check workload
It has several dozen workloads coded into it; you can run frack list to get the list. Currently it is underdocumented and the best way to see what the workloads are is to read the source.

Note that frack is complicated and may still have some problems. There's a known issue that it leaks file handles in some circumstances; if you find out where, please let us know. It is also still somewhat too aggressive about demanding that user data isn't lost.

Using poisondisk

The poisondisk program clears an entire disk to a recognizable (non-zero) byte pattern. The idea is that you can run this, then run mksfs to create a file system, then run a test workload. If the poison values show up in files (or worse, in metadata) after testing, you know you have a problem with exposing uninitialized disk blocks.

You can run poisondisk from inside OS/161 as testbin/poisondisk or outside as hostbin/host-poisondisk.

The poison value (as a byte) is 0xa9, or 169; full words are 0xa9a9a9a9, or 2846468521 (as unsigned decimal) or -1448498775 (as signed decimal). In iso-latin-1 this character prints as a copyright symbol; in UTF-8 a sequence of 0xa9 bytes is an invalid multibyte character.

Miscellaneous Helpful Hints
What to hand in

Design Document

As with the previous assignments, we will be conducting peer design reviews in class, on Tuesday, April 11. You should have a complete first revision of your design document complete by then. Bring two hard copies of that design to class and commit that design to your team repository (in the directory os161/submit/asst4). After the peer review, make any changes to your design document and note what you have changed, then push again to your repository.

By 5:00 pm on Thursday, April 13, please submit your design document by putting it in the submit/asst4 subdirectory of your OS/161 repository as design_document.{pdf,md,txt}. Include solutions to the code reading questions in code_reading.txt in the same directory. Please commit and push to github with a message that clearly identifies the submission, and verify that the submission is visible on the grading server.

Your design document submission must include your plan for converting SFS to being a crash-resilient journalled filesystem. Some things you should consider for your design document include:

As usual, your design document is worth 30% of your assignment grade.

Implementation

Your final submission, due at 5:00 pm on April 28, should include (along with your code) your revised design document and your sys161.conf file. Place all non-code and config files in the directory submit/asst4 under the top level of the OS/161 source.

You should submit by tagging your repo with asst4-submit, and then pushing your final working copy of your source along with your tags to your team git repository. As always, if you are taking late days please be in contact with your TF; and also, if you have to retag your submission or you do anything else with git such that git-level slipups might cause your TF to grade the wrong version, please send your TF the changeset hash of the right version as a precaution. Refer to the previous assignment specifications if you need a refresher on how to submit.

You must verify your intended submission is displayed on the grading server; we will not accept submissions not visible on the grading server.