Computer Science 161 (2013)
Final Exam
May 7, 2013 - May 8, 2013 5:00 PM
240 minutes/240 points
Instructions
The points allocated to each problem correspond to how much time
we think the problem should take. Although you have 24 hours in
which to take the exam, we do not recommend that you pull an
all-nighter and spend 24 hours completing the exam. Here is Margo's
advice on how to complete this exam:
- Download the exam and read through the entire exam.
- Go to dinner and think about the exam.
- Spend a couple of hours Tuesday evening writing up answers the questions
that you know.
- Get a good night's sleep.
- Spend another couple of hours on Wednesday working on the remaining
problems.
- Read over your answers.
- Email them to me at margo@eecs.harvard.edu.
- Show up in Pierce 309 to consume ice cream, pick up your T-shirt,
optionally iron on a team patch (see Piazza post for details).
Please send your answers to me in a PLAIN TEXT file as an attachment
(in a perfect world,
your text file would not have lines longer than 80 characters, but
I'll reformat it if it does). Label the answers clearly so it is
obvious what answers correspond to what questions. Leave space
between each question. Please place your answers in order in the
file. Send your completed exam to margo@eecs.harvard.edu.
I will return the graded exam to you; I tend to engage in a dialog as
much as possible, so there is some educational value in reading more
than just the final number.
This exam is open text book, open source code, and open notes. You
may not use computers (cell phones, PDAs, or any other device that can
talk to a network)
during the exam (other than to record your answers, to view OS/161
or System/161 code, or to examine your notes or the course web site).
Please do not access the Piazza group during the exam.
You must complete the exam on your own.
Exam Policy Statement
At the end of your exam, assuming that you have followed the rules
as stated above, please write the following statement and
then place your name after it, asserting that you have followed the
rules of the exam. "I have neither given nor received help on this exam
and have followed the rules as stated."
If this statement is absent, you will hear from me.
1. Buffer Cache Fun (60 points)
These questions are based on real world challenges in file systems.
A. O_DIRECT
As you learned in assignment 4 and our file system lectures, file
systems often maintain an in-memory collection of pages to speed access
to frequently used files (i.e., the buffer cache).
Software that wishes more explicit control over when data gets to disk
(e.g., transactional database systems) frequently find the buffer cache
a problem. They often implement their own in-memory cache and when they
issue a write system call, they really want the data to go to disk.
In recognition of this problem, some systems provide an open flag
(i.e., a flag to the open system call), O_DIRECT (or the F_NOCACHE
flag to the fcntl system call), that directs the file system not to
cache data for the file in question, but instead to service read/write
requests directly from/to the disk.
Consider the following scenario:
- Process P opens file F with the O_DIRECT flag set.
- The standard cp utility does not set O_DIRECT.
- Now, let's say that P periodically writes blocks to F.
- A shell script wakes up periodically and copies F to make a backup.
If P follows the semantics of O_DIRECT exactly and cp uses the
buffer cache as it currently exists, you can get out-of-date backups.
For example, let's say that at time 1, cp copies file F and leaves
blocks in the buffer cache.
Now P writes a block of F directly to disk (i.e., does not update the
buffer cache).
If cp runs again while its blocks are still in the cache, it could
copy the old version of F without P's most recent update.
Design an implementation of O_DIRECT that solves this problem in the
context of OS/161.
You needn't write pseudocode, but identify the key data structures
you want to change and, in English, describe what parts
of the code you wish to modify and how you wish to modify them.
After you present your design, explain how it avoids the problem
described.
Note: One challenge not explicitly mentioned is that you might have
dirty buffers in the cache when P opens a file O_DIRECT. You may ignore
this situation in this question.
There were many ways to address this problem. In terms of code path,
I was hoping people would mention some aspect of passing the O_DIRECT
around and storing it. Then, you could have done various things:
If you make things go through the buffer cache:
* store O_DIRECT in the open-file structure at open time
* pass O_DIRECT through to VOP_READ/VOP_WRITE
* pass it from sfs_read/sfs_write to and through sfs_io
* in sfs_blockio and sfs_partialio, if writing do buffer_writeout
at the end, and then do buffer_release_and_invalidate instead of
buffer_release.
If you make all accesses to O_DIRECT files bypass the cache
* tag the vnode with O_DIRECT when it's opened that way
(this probably requires a counter, not just a flag, in case there
are multiple such opens)
* in sfs_blockio, if the vnode is tagged O_DIRECT, go straight to
the disk device with the passed-in uio structure.
* in sfs_partialio, even if the vnode is tagged O_DIRECT, you still
need to use a buffer, so do the same thing as in the previous
way.
If you try to invalidate cached blocks when you O_DIRECT write them:
* store O_DIRECT in the open-file structure at open time
* pass O_DIRECT through to VOP_READ/VOP_WRITE
* pass it from sfs_read/sfs_write to and through sfs_io
* in sfs_blockio, if doing O_DIRECT, go straight to the disk device
with the passed-in uio structure.
* on write, also do buffer_drop on the block.
* on read... one way is to ignore the buffer cache (in which case,
non-O_DIRECT writes not yet flushed out may not be seen); another
way is to do buffer_drop, and if someone has done a non-O_DIRECT
write they lose; a third way is to do buffer_read to read the
block, and buffer_release_and_invalidate afterwards (like above),
which guarantees write consistency at the expense of doing an
extra copy for all reads.
* in sfs_partialio, if doing O_DIRECT, just return EINVAL (that is,
demand block-aligned I/O) or do the same as in the other ways.
The first way solves the stale read problem by making sure all buffers
are either dropped or up to date. The second way solves the stale read
problem by not allowing stale data to be cached. The third way solves
it by explicitly invalidating it when it goes out of date.
B. Mmap
One of the system calls we did not ask you to implement is mmap.
Mmap takes a file (or parts thereof) and maps them into the address
space of a process.
Thus, it provides another way (in addition to open/close/read/write)
to access file data from a program.
- Sketch out in a paragraph or two how you might implement a limited
form of mmap that simply maps an entire file into a process's virtual
address space at whatever address the VM system wants. For this part
of the question, ignore what happens if anyone tries to extend the
file or if another process accesses the
file using open/close/read/write (that comes in part B).
At one level, this is only a small extension to the VM system: we have
entries in a process's page table that reference pages on disk.
However, at another level, it greatly complicates it because right now
when a file is backed on disk, that page lives in swap.
Using mmap, we now need to distinguish between pages that live in
swap and pages that live in a particular file.
Let's pretend for now that I implemented a segmented page table.
I would place mmapped files in their own segment, so that I could
associate segments with the right backing stores.
Some of the segments would have a "0-fill" backing store; some would
have swap; and others would reference files.
For this part of the question I'm going to leave the file table in place
and I would have an mmap essentially do an open (for read/write).
I would then give the segment a reference to that open file, indicating
that the file backs this part of virtual memory.
There is some additional error checking I need to do -- I need to find
a place in the address space large enough to map the entire file.
For example, if I have a 32-bit address space and a TB file, I have to fail.
Even if the file isn't larger than my address space, if it's larger than
any contiguous region in my address space I have to fail.
- Discuss in another paragraph or two the implications that mmap has
on the file system buffer cache and how you might address them.
This part is potentialy much trickier.
The problem arises if some other process opens the file and starts reading
and writing to it; if I don't do anything special, there will be parts of
the file in the buffer cache as well as in pages mapped into a process's
address space managed by the VM system.
That sounds pretty ugly.
So it seems that what I want to do is have pages in the buffer cache
that the VM manages.
Most systems today implement an integrated VM and buffer cache, so that
would seem where I'm headed. For now I'll try to keep it simple.
I guess I'd let the file system call upon the VM system to give it pages
that the OS maps into its address space to create the buffer cache.
I would not want to require that all open files be mapped in their
entirety in the kernel's address space, because that would limit the
sizes and/or number of files I could have open.
I'd still use something like the file open table to keep track of what
files were being managed.
So, let's say an mmap comes in and it's a first open: I simply mmap
as above and don't do anything special as above.
Now let's say that another process opens the file normally (i.e., the
open system call).
I would see that the file was already in a process address space when
I looked it up in the file table.
So, I would essentially let the buffer cache now include the region mapped
by the process so that subsequent accesses to that file would result
in my sharing the exact same page in memory with the process.
The slow and simple way to do this would be to create buffer headers
for each mmapped page, but that seems like overkill, so I would want
to teach the buffer manager to consult the process's page table and
create an efficient mechanism to do it (i.e., a pointer into the appropriate
page table).
Now let's consider the opposite order: I have buffers in the buffer
cache and a process tries to mmap it.
That seems fairly straight forward in that when I create the mappings
in the process' address space, I just need to make sure that I map
the pages that are already in memory to the same physical pages that
the VM system is managing. If I refcount pages properly, this should
all work nicely.
2. Virtual Memory (40 points)
The LEG processor MMU has the following characteristics:
- It implements a 32-bit virtual address space.
- It has a 64 entry TLB.
- Each entry can contain a global mapping (address space IDs are
ignored) or it can contain a 6-bit address space ID.
- Each entry can map either 1 MB or 4 KB.
- Each entry specifies access of one of: no-access, read-only, read-write.
- It implements 2-level page tables in hardware.
- The page tables are only accessed on a TLB fault.
- The first/top level page table translates either a full 1 MB region or
it points to a second level table.
- The second level page table translates 4 KB pages.
- Access to the page tables is as follows (you might find it useful
to draw a picture for this process).
- The top 12 bits of the virtual address index into the top level
page table.
- If the entry is a 1 MB region, then the necessary part of the
physical address resides in the top level table.
- If the entry is a not for a 1 MB entry, then the physical address
of the beginning of second level page table resides in the top level
page table.
First level page table entries are stored as shown. The "A M" bits
represent 2 bits for access modes, with the following meanings.
- 00 = Invalid mapping
- 01 = Mapping is valid, but no access is granted
- 10 = read-only
- 11 = read-write
|
--------------------------------- | |
|
|
|
|
1 MB mapping |
20-bits of physical address | |
|
|
|
Mapping for a second level page table |
22-bits of physical address | |
|
|
|
Second level page table entries look like:
- Why are 12 bits of the Virtual Address used to index into
the top level page table?
(4) That leaves 20 bits or 1 MB, so each entry in the top level table maps 1 MB,
which lets you store the translation for large regions in the top level table.
- How many entries are there in the top-level page table?
(2)4096
- On a TLB fault handled by the hardware page table path,
how many memory accesses are required to read data stored in a 1 MB
mapped segment?
(2)2 -- 1 for the 1st level table; 1 for the data
- On a TLB fault handled by the hardware page table path,
how many memory accesses are required to access data stored in a 4 KB page
(not in a 1 MB segment).
(2)3 -- 1 for the 1st level table; 1 for the second
level table; 1 for the data
- Which bit(s) of a first level page table entry tell you whether the
address you are translating is part of a 1 MB region or a 4 KB page?
(2)The low 2 bits -- actually either of the bits
- How many entries are there in a 2nd level page table?
(4)1 MB/4 KB = 256
- As described this processor can only access a 32-bit physical address
space. How could you let it access a larger physical address space?
(4)Use some of the unused bits as part of the physical
address.
- Is it possible to run the LEG processor on a system with
less than 4 GB (32 bits worth) of physical memory?
(2)Yes
- Can two different pages in the same 1 MB region mapped in the
top-level page table have different access protection?
(2) No
- Can two consecutive 4 KB pages mapped through second level page tables
have different access protection? (2) Yes
For the next 4 problems, assume that part of the top level page table
contains the following entries (the first entry shown corresponds to the
1025th one in the table).
Note that 22-bit entries are represented
using the values 0x000000 through 0x3FFFFF.
---- Beginning of Top Level Table ---- |
---- Skipping Entries 0 - 1023 ---- |
0x50000 | 0xFF | 0x3 | 10 |
0x00000 | 0xFF | 0x0 | 10 |
0x123456 | 0x3F | 0x2 | 01 |
0xDEAD0 | 0xFF | 0x2 | 10 |
0x00BEEF | 0x3F | 0x3 | 01 |
0x2B012B | 0x3F | 0x3 | 01 |
0xABCDE | 0xFF | 0x1 | 10 |
---- Rest of Top Level Table ---- |
- At what address will you find the page table entry for virtual
address 0x40277123?
(6)0x48D159DC = 0x123456<<10 + 0x77<<2 =
0100 1000 1101 0001 0101 1001 1101 1100
- Given the information above,
do you know that 0x40277123 is an address from which you are allowed to read?
(2) No -- page permissions could deny it.
- What is the physical address for virtual address 0x40666ABC?
(4) 0xABCDE66ABC (Note that this exceeds 32 bits, which
was really quite unfair)
- Given the information above, do you know that you can read from that
address?(2) Yes -- there are no additional permission
bits to check
3. Multikernel/BarrelFish (80)
For this question, you will need to read the first eight and a half pages of
this paper on the Multikernel architecture
and its implementation as Barrelfish.
(You are, of course, free to read the entire
paper, but I wanted to keep the portion necessary for the exam
within reason.) Please answer the following questions.
- The abstract states, "An evaluation of our prototype on multicore
systems shows that, even on present-day machines, the performance
of a multikernel is comparable with a conventional OS, and can scale
better to support future hardware." Do you believe that it is sufficient
to show that performance is comparable? Why/Why not? Limit your discussion
to 1-2 paragraphs.
(20 points)
I do believe that comparable performance is sufficient - if by comparable,
they mean single-threaded performance. If they include scalability or the
ability to exploit multiple cores, then comparable is not sufficient; they
need to do much better, since that's their focus.
The greatest challenge we face on today's hardware is being able to
exploit multiple cores, so if single-threaded performance is nearly
as good as it is on regular operating systems, that's OK.
They really need to show that they are better able to exploit multiple
cores. This means that if they run many threads/processes they derive
more benefit from additional processors in ways that other operating
systems do not.
Fundamentally, they are claiming that shared memory is a bottleneck, so
they want to show that by designing a system that doesn't use shared
memory as much, they avoid memory bottlenecks and obtain better aggregate
throughput from the sum total of the processors.
- Figure 3 shows that there is a very small overhead in updating
a data item that spans
eight cache lines among many (> 8) cores using message passing,
but a significant overhead in sharing eight cache lines among
an equivalent number of cores using shared memory.
What explanation do they provide for this difference?
(15 points)
In a shared memory arrangement, each write to a cache line causes
stalls for the cache misses as the data moves among the various
cores. As the number of cores increases, the number of stalls
increases. In message passing, there is a relatively constant
overhead in transmitting the message to a server process, but
then there are no cache misses. So, in shared memory we are limited
by the speed of the memory hierarchy while in the message passing
case, we are limited by how long it takes the server to process
the message.
- In Figure 4, let's assign a numerical scale to the X axis such
that "Shared state, one big lock" = 1, "Finer-grained locking" = 2,
"Clustered objects, paritioning" = 3, and "Distributed state, replica
maintenance" = 4. What number would you give OS/161? Why?
(10 points) 2: OS/161 is designed for multiple processors, so it tries
to do something smarter than "one big lock," but it doesn't do anything
fancy such as clustered objects.
- Although OS/161 is a shared memory and not a message-passing
architecture, there are places where a core behaves in a split-phase
manner (section 3.1). Give one example.
(15 points) In the I/O system, we frequently issue an I/O, deschedule
the currently running process and then go off and do something else.
- Had you completed assignments 2, 3, and 4 in barrelfish, for each
major piece of functionality: system calls, scheduling, virtual memory,
journaling, in which barrelfish component(s) would the majority of the
functionality reside? If you specify more than one component, explain
how the work would be partitioned.
(20 points)
- System calls: CPU driver to receive the trap, but would then
dispatch to a user level server in many cases.
- Scheduling: CPU driver
- Virtual memory: CPU driver for actual hardware MMU manipulation;
monitor for keeping track of process page tables.
- Journaling: User-level service
4. Distributed File Systems (18 points)
You have been hired to design a new distributed file system, TNVFCSDFS (the
new very fast, consistent, simple distributed file system).
It is up to you to decide if you want your distributed file system protocol
to be stateful or stateless. You've decided to poll your customers to figure
out what they value most: performance, consistency, or simplicity (which
is likely to get your product to market more quickly and more reliably).
Your design will depend on what you learn from your customers.
- If your customers say that performance is the most important
criteria, what will you decide? Why?
(10 points) This is the interesting one ... I gave credit for either
answer if you made valid points (e.g., with state you can know that
your cached copy is valid and save messages to/from the server on
open and you can cache more aggressively; without state, the master
has less work to do and probably scales better).
- If your customers say that consistency is the most important
criteria, what will you decide? Why?
(4 points) If you want consistency, then stateful is the way to go;
building a consistent distributed file system without state is very
tricky; if you had a rock solid approach, I gave credit, but I don't
think anyone did.
- If your customers say that simplicity is the most important
criteria, what will you decide? Why?
(4 points) Recovery is infinitely easier with a stateless system.
If you could convince me that something was simpler with a stateful
system, I gave credit, but doing that was tricky.
5. Pot Pourri (42 points)
- The answer to the "coming to class on time" question is what?
(Do not stress; this is worth only 2 points.)
I want a pony
- You are the Chief Technology Officer of a startup.
Your company sells a cloud-based service that stores personal account
information including credit card data.
You have been given a budget of one million dollars for this year
to make sure that the data your store is secure.
How will you spend that million dollars? Justify your choice (if you
are going to hire someone, explain what you want that person to do).
(20 points)
People are my greatest vulnerability and I don't have the expertise to
do this, so I'm going to spend about one quarter of this budget to hire
someone who does this for a living. I want that person to:
- Establish user-account management practices: passwords, validation,
password reset requests, account auditing, etc.
- Establish in-house best practices for granting employees access
to various pieces of hardware and software (e.g., DBA privileges).
- Specify fully how we will manage passwords internally.
- Develop a best-coding practices course, with significant testing,
that we will make all our engineers take.
I'll also make about another quarter of a million available to
that person to implement his/her plan.
I would spend another quarter of a million dollar hiring a team
to try to break my system.
I would research the firm aggressively.
I would run a shadow system that they could try to break (i.e., one without
real data, but perhaps one with real passwords).
I'd use the last quarter of a million on insurance.
- One of the goals of virtualization is performance isolation.
Performance isolation means that anything running on a virtual machine
can't really tell if it's running on a virtual machine or not.
You've been hired as the chief architect for VMsRUs. You senior
development engineer, Chris, is confident that the best way to
achieve performance isolation is to simply partition all the resources on
the physical machine among all the VMs running on it.
Write an email to Chris explaining why you think this is a bad idea.
Your email should be somewhere around 2-4 paragraphs.
(20 points)
Although I did not take points off for it, I did look for a graceful
and appropriate introduction to this email. For example, the following,
while technically sound, is not a suggested approach.
Dear Chris, you are a moron. First, just because we have N virtual
machines allocated does not mean that A) all N are active, B) all N
are equally important, or C) all N require the same amount of all
resources. So, statically partitioning resources among four different
VMs with different needs is just a bad idea.
Second, what happens when I decide to add the N+1st VM? Do
you take resources away from the other VMs? What do you do about disks?
Do you just throw people's files away?
Third, what happens when a process uses more than its share of resources,
but it knows that that amount is way less than the HW has to offer? That
would kind of alert something running on the VM that there is something
fishy going on.
Perhaps the following would be more reasonable (I took this from a student
answer since it was way more fun for me to construct the one above.)
Dear Chris,
I've been thinking about that discussion we had about providing performance
isolation to the virtual machines we're running and I'm beginning to think
that simply partitioning the resources among the machines may not be the best
way to go about this. In particular, I have the following concerns ...