Computer Science 161 (2014)

Final Exam

24 hours between May 6, 2014 and May 10, 2014 PM

240 minutes/240 points

Instructions

The points allocated to each problem correspond to how much time we think the problem should take. Although you have 24 hours in which to take the exam, we do not recommend that you pull an all-nighter and spend 24 hours completing the exam.

All questions should be directed to me (Margo) via email. I will check mail periodically, but do not expect an instant response. I strongly encourage you to read through the exam and fire off any questions that immediately come to mind. I will check mail regularly, but will not be watching constantly. When in doubt, document assumptions.

Here is my advice on how to complete this exam:

Read through the entire exam. Send me any questions.
Go to lunch or dinner or take a walk and think about the exam.
Spend a couple of hours on the first day/evening writing up answers the questions that you know. Sketch our your thoughts about some of the longer design problems.
Get a good night's sleep.
Spend another couple of hours on day 2 working on the remaining problems and fleshing out the details of your designs.
Read over your answers.
Email them to me at margo@eecs.harvard.edu.
On Monday, May 12, show up to the 2nd floor Lobby of Maxwell Dworkin to consume waffles and ice cream, pick up your T-shirt, and optionally iron on a team patch (if you have fancier animal artwork than my scanned images, please send them to me before the 12th and I'll have them available for you to iron on).

Please send your answers to me in a PLAIN TEXT file as an attachment (in a perfect world, your text file would not have lines longer than 80 characters, but I'll reformat it if it does). Label the answers clearly so it is obvious what answers correspond to what questions. Leave space between each question. Please place your answers in order in the file. Send your completed exam to margo@eecs.harvard.edu. I will return the graded exam to you; I tend to engage in a dialog as much as possible, so there is some educational value in reading more than just the final number. This exam is open text book, open source code, and open notes. You may not use computers (cell phones, PDAs, or any other device that can talk to a network) during the exam (other than to record your answers, to view OS/161 or System/161 code, or to examine your notes or the course web site). Please do not access the Piazza group during the exam. You must complete the exam on your own.

Exam Policy Statement

Please do not forget to do this!

as stated above, please write the following statement and then place your name after it, asserting that you have followed the rules of the exam. "I have neither given nor received help on this exam and have followed the rules as stated." If this statement is absent, you will hear from me.

1. Retake Concept Inventory (5 points)

This is a free 5 points -- all you need to is complete the concept inventory that you took at the beginning of the year to the best of your ability. You will be graded for completion, not based upon the correctness of your answers, but if you do worse than you did in February, we'll be sad.

2. OS Abstractions (45 points)

The main learning object of this coures is for you to be able to explain how an operating system provides the abstractions with which programmers are familiar. To that end, for each of the three items listed below, explain to a smart student who has completed CS50, is reasonably comfortable programming to the POSIX API, but has not taken CS61 (nor CS161) and doesn't know how the operating system provides each abstraction. (More concretely, your audience knows how to program, has heard of VM but has no idea what it means, knows that computers consist of a processor, memory, and a disk, and knows how to write programs in C that use POSIX system calls like fork, exec, open, close, read, write.)

Please limit your answer to roughly 100-200 words (I won't count, but I want more than 3 sentences, and less than a page).

Processes
A process is the abstraction used to make your program run on a computer. You can think of a process as consisting of a big box, which we call the address space. The box contains all the memory (data) that your program can touch; it also includes the program itself. In addition to your box, you get some active little thing that moves around inside the box executing instructions -- think of it as a really well trained insect -- it knows how to read an instruction and then do what the instruction says, but it can't get out of the box. When it needs to do something that the insect isn't allowed to do (e.g., read from a disk), it makes a really loud noise (system call), that tells you (the operating) system that there is something to do. You look in the box and are given information describing what the insect wants you to do and where to get any additional information and then you do it for the little insect.
Virtual Memory
Let's continue with our box analogy. Remember in Harry Potter how they could have tents that looked normal size on the outside but were ginormous on the inside? This is kind of how your box works. On the inside it looks like you have a huge amount of memory at your disposal (usually about 2 or 4 GB; sometimes a lot more). However, your machine might not have that much memory. VM is what makes that illusion work. In hardware there is some mechanism or mechanisms (TLBs and page tables) and on every instruction they translate an address in your huge box to a real address in the memory. The OS does a quick slight of hand - -if the box needs data and the real memory is full, it picks something to take out of real memory (and leaves a note for itself) and then finds the item needed and puts it in the real memory.
Files
Computers store data on disks in blocks, which are typically chunks of about 4 KB. The operating system builds an index structure for your file that provides the place on disk where each part of your file lives. For example, if your file is 12 KB, the this index structure will say something like, "block 0 is at address A, block 1 is at address B, and block 2 is at address C." Directories work kind of the same way -- the directory is stored as a file, but its contents are structured in a specific manner. You can think of a directory as containing a array of structures, where each structure contains a number that uniquely identifies a file and the name (relative to the directory) corresponding to that file.

3. Pot Pourri (35 points)

(15 points) We know that operating systems enforce security on files by means of access control lists and capabilities, but operating systems must be security conscious about other things as well. Think about the three A's, and list 3 things an operating system does to provide security. For each item, if possible, describe which of the A's it addresses.
Here are five for fun.
1. Virtual memory is a security mechanism; it protects one process from reading another's address space. A virtual address is like a capability for a memory location, so that's an authorization mechanism. The access enforcement is both in the HW and in the VM system that manages page tables and TLBs. The authentication mechanism kicks in when a process is created and given the credentials associated with a particular user
2. Login: Forcing users to log in lets you associate an identity with each session/process, etc. The login mechanism itself is an authentication action.
3. Argument checking of system calls: we copy user process data into the kernel and do some amount of error checking. This is really a kind of access enforcement in that we are trying to prevent user processes from handing data to the OS that would cause it to crash or worse yet, leak information.
4. Process scheduling: this is a kind of DoS avoidance mechanism. It prevents a process from taking over the whole system -- I guess that makes it a kind of acces enforcement, but I don't think it fits neatly in any category.
5. Disallowing user processes direct access to I/O devices. This is a great example of how doing so would bypass access control, allowing users to do thins for which they don't have the authority.
(20 points) Discuss how virtualization and time-sharing are similar/different. Again, one paragraph for each of the similar and different should be sufficient.
Similarities:
- Both are about equitably sharing resources among different agents, where agent = user in a time shared system and agent=VM in a virtualized environment.
- Both want to provide as much isolation as possible, performance isolation as well as protection
- Time sharing systems and deschedule user processes in the same way that a VMM can deschedule a virtual machine
Differences:
- In general, the operating system typically has some notion of priorities or fairness that it uses to, for example, give a benefit to I/O intensive jobs. In the VM environment, we tend to simply design for performance isolation.

4. File System Design: Inode location management (15 points)

LFS keeps track of the locations of inodes by storing a large array in the ifile. The array uses the inode number as an index into this array containing the disk address for the corresponding inode.

In contrast, ZFS stores all dnodes directly in an object set (objset). Each dnode lives in a precise location in the objset (e.g., dnode N lives at offset sizeof(dnode) * N). So updating a dnode is just like updating any other part of an object or objset.

Both systems use a no-overwrite storage mechanism, so inodes and dnodes move around the disk. Explain in one to two paragraphs how these two approaches to tracking the locations of file metadata are similar and dissimilar. Think about how many I/Os are required to update files in each system.
These two approaches actually sound very similar to me. In LFS, we use a file, while ZFS uses an objset. An objset is simply a more general abstraction than a file. However, updates are handled similarly. In LFS, you update an inode and you write a new copy of that inode and simply change the disk address in the ifile which gets written to disk to be looked up later. in ZFS, you update a dnode and that changes the address of a pointer in the dnode of the objset. If we think of the number of I/Os, in LFS you get:

1 I/O for the updated inode

1 I/O for the corresponding block in the ifile

1 I/O for the ifile inode

1 I/O for the superblock that tell you wher the ifile inode lives

In ZFS you get

1 I/O for the updated dnode
1 I/O for the dnode of the objset containing a file system's dnode
1 I/O for the pointer to the dnode of the objset.

So, ZFS uses on fewer I/O. BUT ... because an update to a dnode is "in place", you will write many dnodes out that didn't change (i.e., more than 1 dnode fits in a block). In LFS, you repack inodes together, so that if you modify inodes 5 500 and 5000, you can put them all in the same block. In ZFS, dnodes 5, 6, 7, etc will always be in the same block while dnodes 500 and 5000 will be in different blocks. So, ZFS may end up writing more total blocks.
At the end of the day, these approaches seem quite similar to me.

5. File System Design: Journaling user data

This is one of those long design questions. Read it through first, but then move on to some of the short answer questions, before diving into this one.

Today's modern file systems, like ZFS, support many features, such as:

Deduplication: finding blocks containing identical data, storing only one copy of such blocks, and then letting multiple files reference them.
Snapshots: read-only versions of a file system as of a point in time.
Clones: These begin as a snapshot, but allow subsequent updates.

These features result in data blocks being referenced from multiple places (e.g., two different files; two different snapshots).

Your job is to design a scheme for efficiently journaling user data in such a system. You should take advantage of the fact that blocks can be referenced from multiple locations. You should assume the following about your system (many of these ideas are taken from ZFS):

Meta-data updates are journaled. That means that you have a journaling and recovery mechanism, so you only need talk about adding new records and possibly changing how recovery might work.
Your system takes periodic checkpoints (on the order of every 5-10 seconds, and you are guaranteed that the disk is 100% consistent after every checkpoint.
It is OK to delay all file system operations, except those that explicitly require synchronous writes (e.g., sync, fsync). However, it is not OK to lose all updates between checkpoints, so your log does need to be written more frequently than every checkpoint; it just need not be flushed on every transaction.

Part I (50 points)

In this part, you will explain your mechanism for journaling user data. The key design constraint that you must satisfy is that file data may only be written to persistent storage once. You should include a description of any log record (s) you wish to use. You should explain when these log records are created and when they must be written to persistent storage. You should also explain if the addition of these log records introduces any dependencies in your buffer management. Finally, discuss the advantages and disadvantages of your approach.

We suggest that you begin approaching this problem by figuring out everything that you need to do to handle a write system call. Then consider how you will handle the following two scenarios (where BLOCKSIZE is the size of a block in the file system):

Scenario 1
	char buf[BLOCKSIZE]

	write(fd, buf, BLOCKSIZE);
Scenario 2
	char buf[BLOCKSIZE]

	write(fd, buf, BLOCKSIZE/2);
	write(fd, buf+BLOCKSIZE/2, BLOCKSIZE/2);

Next start figuring out log records you will need and any dependencies in your buffer cache. Write that all up and then step back and assess it critically to discuss the advantages and disadvantages.
The key insight is that we can write the data once and then reference it both from the log and from a file. This introduces the following tricky parts:

we will need to write the data to disk before we write the log record. This is TOTALLY backwards from practically everything we do. The only reason it is OK is because new data is always written to new blocks, so the only bad outcome from this is that we write data, no log record is ever written, (due to a crash) and those blocks continue to be treated as free blocks. This wastes disk bandwidth, but does not damage the file system's integrity.
there are multiple steps involved in writing and logging data
free space management just got a big more complicated.

When an application writes data to a file, we need to do the following:

Find a free block into which to write the new data.
Log the block allocation
Write the data
Write a log record referencing the newly written data block
Log the change in the dnode
Flush the log at commit

As discussed above, we need to make sure that we flush the data (step iii), before we let record iv go to disk (otherwise we have a log entry referencing data that is not yet written). Fortunately, we don't really have to flush the log on every commit; we can batch commits together and push them to disk once we've accumulated a lot of data. In the meantime, we might be able to write blocks that were written in their entirety to disk.
So, our new log record looks something like:

	txnid = TXNID
	type = data write
	file = identify the file (dnode)
	block = block number
	daddr = block pointer referencing the newly written block
	chksum = a checksum: not strictly necessary, but if only one copy of
		the block is successfully written, perhaps we can use this to
		figure out which of the multiple copies is correct and recovery
		partially written data?  That seems like a pretty advanced
		feature, so I'm not going to discuss it further, but it's a somewhat
		neat idea of a way to not have to write all N copies of a block.

When we create that log record, we also make it dependent upon the data block --we must not flush the log until that data block is safely on disk.
A huge disadvantage of this scheme is that instead of all log writes now being sequential, we have these data blocks that are strewn hither and yond around the storage system. In order to minimize the impact, I will asynchronously write blocks if we "fill" them. I'll call blocks "filled" when we write the last byte in them. So, if you write sequentially (as all good programs do), then when you finish writing a block, we'll write it asynchronously. If you write a partial block, we won't immediately write it, but will keep it cached until either A) you finishe writing it or B) we have to flush it so we can write a log record.
OK, so now our write algorith looks like the following:

Find a block to allocate
Log the allocation
Log the dnode (or indirect block) update to reference the newly allocated block
Update the buffer in memory
If we "filled" the block, schedule it for write.
Log the write record described above
Commit

And our log flush algorithm looks like this:

For any write records in the current batch of committing transactions, verify that the buffer has been written to disk; if it has not, write it now.
Flush the log

We can test the first condition by simply retaining in-memory a list of blocks on which chunks of the log depend (by chunk of log, I mean a decent size unit of log that I write).
Advantages

We now have user-data journaling (yeay!)
We don't write the data twice (yeay!)

Disadvantages

We can clutter up our I/O system with non-sequential writes as part of flushing the log.
If we have random access writes to a block and frequently write the last byte of the block, we might write the block multiple times (although we might be able to detect this).
We might end up slowing log writes because of the data writes.

Part II (10 points)

Now consider any modifications that you might need to make to the system's free space management.
Our free space management already has to handle the case that a block can be referenced from multiple places, so that's a big win. The only thing we really need to figure out is:

When does the log no longer need to retain its reference to a data block

Where do we store that information

Item 1 is the big deal: the log is only required to keep track of things between checkpoints, so we know that the log can relinquish its reference to a block as soon as the next checkpoint occurs. That's great news! It means that all we need to ensure is that we don't fully reclaim any disk blocks until the checkpoint following its allocation - that doesn't sound tough does it?
Let's think about deallocation, because that's when we might have to think about this. So let's assume that we're doing a truncate (or a delete). There are two cases to consider:
The block was allocated in a previous checkpoint: no problem, we don't need the log record any more, so we can ignore its reference.
The block was allocated in the current checkpoint: the deallocation in and of itself isn't a huge problem, but if we were to reallocate it again, then we could create the following situation:

	
	Allocate and write block B
	...
	
	
	Free block B
	
	
	Allocate and write block B
	flush the block to disk
	crash...

Let's assume that B did get written to disk in txn 3, and furthermore, let's assume that the log record for the txn 2 didn't make it to disk. When we try to recover the log, our original write had better still be valid. So, this suggests that we cannot reallocate blocks allocated in the current checkpoint.
That doesn't sound so hard -- we need to keep track of the newly allocated blocks until log flushes (as mentioned above), so we can keep track of them until the end of the checkpoint, and simply delay deallocations for those blocks until the next checkpoint. That doesn't seem so hard.

6. Designing for Byte-Addressable Persistent Memory (80 points)

For this question, you will need to read this paper from the beginning up until section 4.1 (You will want to read about PCM, but you can stop after that.) The paper presents a way to build a file system on byte-addressable persistent memory. (You are, of course, free to read the entire paper, but I wanted to keep the portion necessary for the exam within reason.) You will undoubtedly find it useful to read the questions before you start reading.

In your own words, describe short-circuit shadow paging.
Short circuit shadow paging is a way of stopping copy on write behavior before you get all the way up to the root of a file system. Basically, you CoW until the point where you can effect the change with a single atomic write.
Could you use short-circuit shadow paging in a more conventional block-based file system? If so, explain how you would use it. If not, explain why it's not possible.
Yes, you can use a similar technique in a traditional file system, although it's a little fuzzy exactly how this would differ from a conventional system. The key is to figure out when you get to an atomic unit, which would typically be a sector. On one hand, most file system operations only affect a single block, but a block could be larger than a sector I suppose. Also, multi-block writes definitely span atomic units, so you'd need to propagate up until you're either in the same indirect block or at the inode.
The paper introduces the problem caused by the fact that processors are allowed to reorder when data gets flushed to main memory. They introduce epoch barriers to address this. Disks also reorder requests. Explain how the situation between disks and memory are the same or different; if they are the same, explain what mechanism we use with disk drives to ensure proper ordering.
The problem is similar -- that's why non-journaling, non-soft update systems end up having to perform synchronous I/O -- it's the only way to ensure that one I/O completes before another. Soft-updates is another way that we try to mitigate this, but at the bottom, it still boils down to synchronous writes.
Explain how BPFS is well-suited to PCM as opposed to other, more traditional block-based systems that we've studied.
For one, BPFS uses a much smaller block size since there is no mechanical apparatus whose performance needs to be amortized. Second, BPFS uses atomic 64 bit writes in several places; this would not be feasiable in a block-based system.
It is likely that some form of byte-addressable persistent memory will be available on systems within the next few years. Propose a different way you might use this technology (i.e., something other than a file system).
Making a persistent (non-volatile) heap seems like one possibly good idea. Making processes persistent for instant restart seems like another. Maintaining a persistent cache that you can continue to use after a failure seems like another.