Computer Science 161 (2014)
Final Exam
24 hours between May 6, 2014 and May 10, 2014 PM
240 minutes/240 points
Instructions
The points allocated to each problem correspond to how much time
we think the problem should take. Although you have 24 hours in
which to take the exam, we do not recommend that you pull an
all-nighter and spend 24 hours completing the exam.
All questions should be directed to me (Margo) via email. I will
check mail periodically, but do not expect an instant response.
I strongly encourage you to read through the exam and fire off
any questions that immediately come to mind. I will check
mail regularly, but will not be watching constantly. When in
doubt, document assumptions.
Here is my advice on how to complete this exam:
- Read through the entire exam. Send me any questions.
- Go to lunch or dinner or take a walk and think about the exam.
- Spend a couple of hours on the first day/evening writing up answers the
questions that you know. Sketch our your thoughts about some of the longer
design problems.
- Get a good night's sleep.
- Spend another couple of hours on day 2 working on the remaining
problems and fleshing out the details of your designs.
- Read over your answers.
- Email them to me at margo@eecs.harvard.edu.
- On Monday, May 12, show up to the 2nd floor Lobby of Maxwell Dworkin
to consume waffles and ice cream, pick up your T-shirt,
and optionally iron on a team patch (if you have fancier animal artwork
than my scanned images, please send them to me before the 12th and I'll
have them available for you to iron on).
Please send your answers to me in a PLAIN TEXT file as an attachment
(in a perfect world, your text file would not have lines longer than
80 characters, but I'll reformat it if it does).
Label the answers clearly so it is
obvious what answers correspond to what questions. Leave space
between each question. Please place your answers in order in the
file. Send your completed exam to margo@eecs.harvard.edu.
I will return the graded exam to you; I tend to engage in a dialog as
much as possible, so there is some educational value in reading more
than just the final number.
This exam is open text book, open source code, and open notes. You
may not use computers (cell phones, PDAs, or any other device that can
talk to a network)
during the exam (other than to record your answers, to view OS/161
or System/161 code, or to examine your notes or the course web site).
Please do not access the Piazza group during the exam.
You must complete the exam on your own.
Exam Policy Statement
Please do not forget to do this!
as stated above, please write the following statement and
then place your name after it, asserting that you have followed the
rules of the exam. "I have neither given nor received help on this exam
and have followed the rules as stated."
If this statement is absent, you will hear from me.
1. Retake Concept Inventory (5 points)
This is a free 5 points -- all you need to is complete the
concept inventory that you took at the beginning of the year to the best
of your ability.
You will be graded for completion, not based upon the correctness of your answers,
but if you do worse than you did in February, we'll be sad.
2. OS Abstractions (45 points)
The main learning object of this coures is for you to be able to explain how
an operating system provides the abstractions with which programmers are
familiar.
To that end, for each of the three items listed below, explain to a smart
student who has completed CS50, is reasonably comfortable programming to the
POSIX API, but has not taken CS61 (nor CS161) and doesn't know how the operating
system provides each abstraction.
(More concretely, your audience knows how to program, has heard of VM but has
no idea what it means, knows that computers consist of a processor, memory,
and a disk, and knows how to write programs in C that use POSIX system calls
like fork, exec, open, close, read, write.)
Please limit your answer to roughly 100-200 words (I won't count, but I want more
than 3 sentences, and less than a page).
- Processes
A process is the abstraction used to make your program run on a computer.
You can think of a process as consisting of a big box, which we call the
address space. The box contains all the memory (data) that your program
can touch; it also includes the program itself. In addition to your box,
you get some active little thing that moves around inside the box executing
instructions -- think of it as a really well trained insect -- it knows how
to read an instruction and then do what the instruction says, but it can't
get out of the box. When it needs to do something that the insect isn't
allowed to do (e.g., read from a disk), it makes a really loud noise (system
call), that tells you (the operating) system that there is something to do.
You look in the box and are given information describing what the insect
wants you to do and where to get any additional information and then you do
it for the little insect.
- Virtual Memory
Let's continue with our box analogy. Remember in Harry Potter how they could
have tents that looked normal size on the outside but were ginormous on the
inside? This is kind of how your box works. On the inside it looks like you
have a huge amount of memory at your disposal (usually about 2 or 4 GB;
sometimes a lot more). However, your machine might not have that much memory.
VM is what makes that illusion work. In hardware there is some mechanism or
mechanisms (TLBs and page tables) and on every instruction they translate an
address in your huge box to a real address in the memory. The OS does a quick
slight of hand - -if the box needs data and the real memory is full, it picks
something to take out of real memory (and leaves a note for itself) and then
finds the item needed and puts it in the real memory.
- Files
Computers store data on disks in blocks, which are typically chunks of about 4 KB.
The operating system builds an index structure for your file that provides the
place on disk where each part of your file lives. For example, if your file is
12 KB, the this index structure will say something like, "block 0 is at address A,
block 1 is at address B, and block 2 is at address C." Directories work kind
of the same way -- the directory is stored as a file, but its contents are structured
in a specific manner. You can think of a directory as containing a array of
structures, where each structure contains a number that uniquely identifies a file
and the name (relative to the directory) corresponding to that file.
3. Pot Pourri (35 points)
- (15 points) We know that operating systems enforce security on files by means of
access control lists and capabilities, but operating systems must be
security conscious about other things as well.
Think about the three A's, and list 3 things an operating system does to
provide security. For each item, if possible, describe which of the A's it
addresses.
Here are five for fun.
- Virtual memory is a security mechanism; it protects one process from
reading another's address space. A virtual address is like a capability
for a memory location, so that's an authorization mechanism. The access
enforcement is both in the HW and in the VM system that manages page
tables and TLBs. The authentication mechanism kicks in when a process
is created and given the credentials associated with a particular user
- Login: Forcing users to log in lets you associate an identity with
each session/process, etc. The login mechanism itself is an authentication
action.
- Argument checking of system calls: we copy user process data into the
kernel and do some amount of error checking. This is really a kind of
access enforcement in that we are trying to prevent user processes from
handing data to the OS that would cause it to crash or worse yet, leak
information.
- Process scheduling: this is a kind of DoS avoidance mechanism. It
prevents a process from taking over the whole system -- I guess that
makes it a kind of acces enforcement, but I don't think it fits neatly
in any category.
- Disallowing user processes direct access to I/O devices. This is a great
example of how doing so would bypass access control, allowing users to
do thins for which they don't have the authority.
- (20 points) Discuss how virtualization and time-sharing are similar/different.
Again, one paragraph for each of the similar and different should be sufficient.
Similarities:
- Both are about equitably sharing resources among different agents, where
agent = user in a time shared system and agent=VM in a virtualized
environment.
- Both want to provide as much isolation as possible, performance isolation
as well as protection
- Time sharing systems and deschedule user processes in the same way that
a VMM can deschedule a virtual machine
Differences:
- Virtualization adds a layer of indirection that time-sharing didn't have
and thus makes some things more difficult (e.g., the need for
a balloon driver instead of directly managing memory).
- In general, the operating system typically has some notion of priorities
or fairness that it uses to, for example, give a benefit to I/O intensive
jobs. In the VM environment, we tend to simply design for performance isolation.
4. File System Design: Inode location management (15 points)
LFS keeps track of the locations of inodes by storing a large array
in the ifile. The array uses the inode number as an index into
this array containing the disk address for the corresponding inode.
In contrast, ZFS stores all dnodes directly in an object set (objset). Each
dnode lives in a precise location in the objset (e.g., dnode N lives
at offset sizeof(dnode) * N). So updating a dnode is just like updating
any other part of an object or objset.
Both systems use a no-overwrite storage mechanism, so inodes and
dnodes move around the disk. Explain in one to two paragraphs
how these two approaches
to tracking the locations of file metadata are similar and dissimilar.
Think about how many I/Os are required to update files in each system.
These two approaches actually sound very similar to me. In LFS, we use a file, while
ZFS uses an objset. An objset is simply a more general abstraction than a file. However,
updates are handled similarly. In LFS, you update an inode and you write a new
copy of that inode and simply change the disk address in the ifile which gets written
to disk to be looked up later. in ZFS, you update a dnode and that changes the address
of a pointer in the dnode of the objset. If we think of the number of I/Os, in LFS you get:
- 1 I/O for the updated inode
- 1 I/O for the corresponding block in the ifile
- 1 I/O for the ifile inode
- 1 I/O for the superblock that tell you wher the ifile inode lives
In ZFS you get
- 1 I/O for the updated dnode
- 1 I/O for the dnode of the objset containing a file system's dnode
- 1 I/O for the pointer to the dnode of the objset.
So, ZFS uses on fewer I/O. BUT ... because an update to a dnode is "in place", you
will write many dnodes out that didn't change (i.e., more than 1 dnode fits in a block).
In LFS, you repack inodes together, so that if you modify inodes 5 500 and 5000, you can
put them all in the same block. In ZFS, dnodes 5, 6, 7, etc will always be in the
same block while dnodes 500 and 5000 will be in different blocks. So, ZFS may end up
writing more total blocks.
At the end of the day, these approaches seem quite similar to me.
5. File System Design: Journaling user data
This is one of those long design questions. Read it through first,
but then move on to some of the short answer questions, before
diving into this one.
Today's modern file systems, like ZFS, support many features, such as:
- Deduplication: finding blocks containing identical data,
storing only one copy of such blocks, and then letting multiple
files reference them.
- Snapshots: read-only versions of a file system as of a point
in time.
- Clones: These begin as a snapshot, but allow subsequent updates.
These features result in data blocks being referenced from multiple
places (e.g., two different files; two different snapshots).
Your job is to design a scheme for efficiently journaling user data in such a
system. You should take advantage of the fact that blocks can be referenced from
multiple locations.
You should assume the following about your system (many of these ideas are taken
from ZFS):
- Meta-data updates are journaled. That means that you have a journaling and
recovery mechanism, so you only need talk about adding new records and possibly
changing how recovery might work.
- Your system takes periodic checkpoints (on the order of every 5-10 seconds,
and you are guaranteed that the disk is 100% consistent after every checkpoint.
- It is OK to delay all file system operations, except those that explicitly
require synchronous writes (e.g., sync, fsync). However, it is not OK to lose
all updates between checkpoints, so your log does need to be written more
frequently than every checkpoint; it just need not be flushed on every transaction.
Part I (50 points)
In this part, you will explain your mechanism for journaling user data.
The key design constraint that you must satisfy is that file data may
only be written to persistent storage once.
You should include a description of any log record (s) you wish to use.
You should explain when these log records are created and when they must
be written to persistent storage.
You should also explain if the addition of these log records introduces
any dependencies in your buffer management.
Finally, discuss the advantages and disadvantages of your approach.
We suggest that you begin approaching this problem by figuring out everything
that you need to do to handle a write system call.
Then consider how you will handle the following two scenarios (where BLOCKSIZE is
the size of a block in the file system):
Scenario 1
char buf[BLOCKSIZE]
write(fd, buf, BLOCKSIZE);
Scenario 2
char buf[BLOCKSIZE]
write(fd, buf, BLOCKSIZE/2);
write(fd, buf+BLOCKSIZE/2, BLOCKSIZE/2);
Next start figuring out log records you will need and any
dependencies in your buffer cache.
Write that all up and then step back and assess it critically
to discuss the advantages and disadvantages.
The key insight is that we can write the data once and then reference it both
from the log and from a file. This introduces the following tricky parts:
- we will need to write the data to disk before we write the log record.
This is TOTALLY backwards from practically everything we do. The only reason
it is OK is because new data is always written to new blocks, so the only
bad outcome from this is that we write data, no log record is ever written,
(due to a crash) and those blocks continue to be treated
as free blocks. This wastes disk bandwidth, but does not
damage the file system's integrity.
- there are multiple steps involved in writing and logging data
- free space management just got a big more complicated.
When an application writes data to a file, we need to do the following:
- Find a free block into which to write the new data.
- Log the block allocation
- Write the data
- Write a log record referencing the newly written data block
- Log the change in the dnode
- Flush the log at commit
As discussed above, we need to make sure that we flush the data (step iii), before we
let record iv go to disk (otherwise we have a log entry referencing data that is not
yet written). Fortunately, we don't really have to flush the log on every commit; we
can batch commits together and push them to disk once we've accumulated a lot of
data. In the meantime, we might be able to write blocks that were written in their
entirety to disk.
So, our new log record looks something like:
txnid = TXNID
type = data write
file = identify the file (dnode)
block = block number
daddr = block pointer referencing the newly written block
chksum = a checksum: not strictly necessary, but if only one copy of
the block is successfully written, perhaps we can use this to
figure out which of the multiple copies is correct and recovery
partially written data? That seems like a pretty advanced
feature, so I'm not going to discuss it further, but it's a somewhat
neat idea of a way to not have to write all N copies of a block.
When we create that log record, we also make it dependent upon the data block --we must
not flush the log until that data block is safely on disk.
A huge disadvantage of this scheme is that instead of all log writes now being
sequential, we have these data blocks that are strewn hither and yond around the
storage system. In order to minimize the impact, I will asynchronously write blocks
if we "fill" them. I'll call blocks "filled" when we write the last byte in them.
So, if you write sequentially (as all good programs do), then when you finish
writing a block, we'll write it asynchronously. If you write a partial block, we won't
immediately write it, but will keep it cached until either A) you finishe writing it
or B) we have to flush it so we can write a log record.
OK, so now our write algorith looks like the following:
- Find a block to allocate
- Log the allocation
- Log the dnode (or indirect block) update to reference the newly allocated block
- Update the buffer in memory
- If we "filled" the block, schedule it for write.
- Log the write record described above
- Commit
And our log flush algorithm looks like this:
- For any write records in the current batch of committing transactions,
verify that the buffer has been written to disk; if it has not, write it now.
- Flush the log
We can test the first condition by simply retaining in-memory a list of blocks on
which chunks of the log depend (by chunk of log, I mean a decent size unit of log
that I write).
Advantages
- We now have user-data journaling (yeay!)
- We don't write the data twice (yeay!)
Disadvantages
- We can clutter up our I/O system with non-sequential writes as part of
flushing the log.
- If we have random access writes to a block and frequently write the last
byte of the block, we might write the block multiple times (although we
might be able to detect this).
- We might end up slowing log writes because of the data writes.
Part II (10 points)
Now consider any modifications that you might need to make to the
system's free space management.
Our free space management already has to handle the case that a block
can be referenced from multiple places, so that's a big win. The only
thing we really need to figure out is:
- When does the log no longer need to retain its reference to a data block
- Where do we store that information
Item 1 is the big deal: the log is only required to keep track of things
between checkpoints, so we know that the log can relinquish its reference
to a block as soon as the next checkpoint occurs. That's great news! It
means that all we need to ensure is that we don't fully reclaim any disk
blocks until the checkpoint following its allocation - that doesn't sound
tough does it?
Let's think about deallocation, because that's when we might have to think
about this. So let's assume that we're doing a truncate (or a delete). There
are two cases to consider:
The block was allocated in a previous checkpoint: no problem, we don't need
the log record any more, so we can ignore its reference.
The block was allocated in the current checkpoint: the deallocation in and
of itself isn't a huge problem, but if we were to reallocate it again, then
we could create the following situation:
Allocate and write block B
...
Free block B
Allocate and write block B
flush the block to disk
crash...
Let's assume that B did get written to disk in txn 3, and furthermore, let's assume
that the log record for the txn 2 didn't make it to disk. When we try to recover
the log, our original write had better still be valid. So, this suggests that we
cannot reallocate blocks allocated in the current checkpoint.
That doesn't sound so hard -- we need to keep track of the newly allocated blocks
until log flushes (as mentioned above), so we can keep track of them until the
end of the checkpoint, and simply delay deallocations for those blocks until the
next checkpoint. That doesn't seem so hard.
6. Designing for Byte-Addressable Persistent Memory (80 points)
For this question, you will need to read
this paper
from the beginning up until section 4.1 (You will want to read about PCM, but you can
stop after that.)
The paper presents a way to build a file system on
byte-addressable persistent memory.
(You are, of course, free to read the entire
paper, but I wanted to keep the portion necessary for the exam
within reason.)
You will undoubtedly find it useful to read the questions
before you start reading.
- In your own words, describe short-circuit shadow paging.
Short circuit shadow paging is a way of stopping copy on write behavior
before you get all the way up to the root of a file system.
Basically, you CoW until the point where you can effect the change with a
single atomic write.
- Could you use short-circuit shadow paging in a more
conventional block-based file system? If so, explain how
you would use it. If not, explain why it's not possible.
Yes, you can use a similar technique in a traditional file system, although it's
a little fuzzy exactly how this would differ from a conventional system. The key
is to figure out when you get to an atomic unit, which would typically be a sector.
On one hand, most file system operations only affect a single block, but a block could
be larger than a sector I suppose. Also, multi-block writes definitely span atomic
units, so you'd need to propagate up until you're either in the same indirect block
or at the inode.
- The paper introduces the problem caused by the fact that
processors are allowed to reorder when data gets flushed to main
memory. They introduce epoch barriers to address this.
Disks also reorder requests. Explain how the situation between
disks and memory are the same or different; if they are the same,
explain what mechanism we use with disk drives to ensure proper
ordering.
The problem is similar -- that's why non-journaling, non-soft update systems end up
having to perform synchronous I/O -- it's the only way to ensure that one I/O completes
before another. Soft-updates is another way that we try to mitigate this, but at the
bottom, it still boils down to synchronous writes.
- Explain how BPFS is well-suited to PCM as opposed to other, more
traditional block-based systems that we've studied.
For one, BPFS uses a much smaller block size since there is no mechanical
apparatus whose performance needs to be amortized.
Second, BPFS uses atomic 64 bit writes in several places; this would not be
feasiable in a block-based system.
- It is likely that some form of byte-addressable persistent memory
will be available on systems within the next few years.
Propose a different way you might use this technology (i.e., something
other than a file system).
Making a persistent (non-volatile) heap seems like one possibly good idea.
Making processes persistent for instant restart seems like another.
Maintaining a persistent cache that you can continue to use after a failure
seems like another.