CS417 Exam 3

Fall 2012

Paul Krzyzanowski

Part I – 12 points

6 points
You have a huge amount of data on stock transactions. Each record contains the following information:
```
						{ timestamp, ticker_symbol, sale_price, shares }
```
You want to find the average share price of each traded company for the year 2005 and will do this using MapReduce. Explain the operations in your map and reduce functions. Assume that a default partition function is used.
For the map function, assume that the function is called once for each line of input and contains the parameters mentioned above. Explain any processing logic that takes place in the map worker to pre-process, discard, duplicate, or combine data. Also, be sure to state the sequence of data that each map worker emits, with the first item being the key. You may use the pseudocode emit(a, b, c, ...) to represent the sequence of data that a map worker emits.
For the reduce function, assume that each worker is invoked with the parameters <key, data_list>. You may use “for each X in data_list” pseudocode to iterate over each of the items in data_list. You may also use the pseudocode emit(a, b, c, ...) to represent the sequence of data that a reduce worker emits. Explain any processing logic that takes place in the reduce worker.

Map(timestamp, ticker_symbol, sale_price, share_count):

Reduce(key, data_list):
6 points
How can Bob use Alice’s X.509 digital certificate to authenticate Alice? Explain the sequence of steps needed. Ignore the need to validate the certificate.

Part II – 87 points – 3 points each

For each statement, select the most appropriate answer. You may omit one question. Please indicate your choice clearly.

Akamai addresses the flash crowd problem by
(a) Transparently replicating data on servers managed by Akamai.
(b) Providing a clustering solution for a company’s data center.
(c) Multihoming: connecting a customer’s machines to multiple networks to guard against ISP failure.
(d) Connecting a company’s machines to Akamai’s network instead of the Internet.
Unlike conventional DNS servers, Akamai’s dynamic DNS
(a) Will likely provide a different IP address to the same query depending on where the request comes from.
(b) Will track changing IP addresses for a server and provide the latest address in response to a query.
(c) Can handle multiple requests concurrently.
(d) Will return different IP addresses to the same query to load balance requests among a customer’s machines.
Akamai’s load shedding is the process of
(a) Not returning the IP addresses of heavily loaded machines in response to a DNS query.
(b) Migrating jobs from heavily loaded machines to more lightly loaded machines.
(c) Terminating long-running processes on machines when they become heavily loaded.
(d) Setting a lower threshold for the rate of queries that a DNS server will process.
The partitioning function in MapReduce
(a) Determines which shard will be assigned to a specific map worker.
(b) Filters out unnecessary input data prior to being processed by the map worker.
(c) Determines the division of available servers into map workers and reduce workers.
(d) Determines which reduce worker will process data associated with a particular key.
In MapReduce, the first reduce worker starts processing data
(a) At the same time as map workers start in order to maximize concurrency.
(b) As soon as at least one map worker begins to emit output.
(c) When least one map worker completes.
(d) When the last map worker completes.
A tablet in Bigtable is a subset of a table containing
(a) A range of consecutive rows and a range of consecutive column families.
(b) All rows but a subset of column families
(c) Frequently used rows and column families cached in memory for performance.
(d) A range of consecutive rows and all column families.
Differing from a distributed hash table, Bigtable
(a) Can support replication for fault tolerance.
(b) Supports more than one primary key for looking up data.
(c) Can associate large amounts of data with a key.
(d) Sorts its data by the key and allows iteration over rows of sorted data.
Back propagation in flooding-based distributed lookup is used when
(a) The system needs to constrain the maximum number of hops used in processing a query.
(b) A query, through a series of hops, loops back at the originating node.
(c) A node needs to find the most efficient route to the original requestor.
(d) A node needs to reply but does not know the address of the original requestor.
Which statement is not true about CAN, the Content-Addressable Network?
(a) CAN makes use of an overlay network.
(b) Adding a new node to the group will usually not require the majority of keys to be remapped.
(c) A query key will be hashed several times.
(d) For a system with n nodes, a query will typically require O(n) hops.
Amazon Dynamo allows an administrator to add a greater load to a bigger server in the group by
(a) Configuring the client library to issue a higher percentage of its requests to that server.
(b) Configuring other servers to forward some percentage of their requests to the bigger server.
(c) Using consistent hashing and giving a larger consecutive range of the hash space to the server.
(d) Assigning more virtual nodes to that server.
Application-based reconciliation in Dynamo allows an application to
(a) Resolve conflicts at write time instead of at read time.
(b) Send updates to a server even if fewer than half of the replica servers are available.
(c) Ensure that data is propagated to all nodes holding replicas of a key, value set.
(d) Make a decision on what to do about conflicts resulting from past concurrent writes.
Amazon Dynamo does not achieve high performance by
(a) Having the ability to add more nodes dynamically.
(b) Concurrently dispatching one query to multiple nodes.
(c) Ensuring each node knows the entire set of servers.
(d) Allowing the optional use of an in-memory database.
Suppose that you could test all combinations of a 56-bit key in one hour. Testing a 64-bit key will take
(a) 3 hours.
(b) 8 hours.
(c) 256 hours (10.7 days).
(d) 65,536 hours (7.5 years)
For Alice to send a message so that only Bob can read it, she would
(a) Encrypt the message with Alice’s private key.
(b) Encrypt the message with Alice’s public key.
(c) Encrypt the message with Bob’s private key.
(d) Encrypt the message with Bob’s public key.
The Diffie-Hellman algorithm allows Alice and Bob to
(a) Compute a value that can be used as a key for symmetric cryptography.
(b) Share a session key using public key cryptography.
(c) Compute a value that can be used as a key for public key cryptography.
(d) Validate each other’s identity but not communicate securely.
Kerberos avoids key explosion by
(a) Using a trusted third party.
(b) Using the Diffie-Hellman algorithm.
(c) Using common keys.
(d) Using public key cryptography.
For Alice to send a signature of a message to Bob, she hashes the message and encrypts it with:
(a) Alice’s private key.
(b) Alice’s public key.
(c) Bob’s private key.
(d) Bob’s public key.
A hybrid cryptosystem uses
(a) Public key cryptography for key exchange and symmetric cryptography for communication.
(b) Two layers of encryption for added security: symmetric cryptography and public key cryptography.
(c) A different cryptosystem for each direction of communication.
(d) Symmetric cryptography for key exchange and public key cryptography for communication.
The Challenge Handshake Authentication Protocol relies on
(a) Hash functions.
(b) Public key cryptography.
(c) Symmetric cryptography.
(d) A combination of all of the above.
OpenID and OAuth both
(a) Use HTTP redirection.
(b) Allow a user to maintain a single ID across multiple services.
(c) Allow the user to authorize limited functionality for a service.
(d) Use a third-party to manage user IDs.
Compared with a regular packet filter, a packet filter with deep packet inspection capabilities (DPI) will enable you to
(a) Block specific URLs targeted for your web server on TCP port 80.
(b) Drop all IP packets originating from 128.6.*.*.
(c) Accept all packets going to your mail server on TCP port 25 except those from 128.6.4.2.
(d) Block external packets that are masquerading as internal packets (i.e., their source address is that of your internal network)
A DMZ contains machines that
(a) Offer services to the external network.
(b) Are potentially not secured and hence vulnerable.
(c) Do not require connectivity to either the external or internal networks
(d) Host services that are accessed only from the internal network.
Tunneling in a VPN (Virtual Private Network) refers to
(a) A way of detecting malicious packets being injected into the network.
(b) A way of securing communication between two network endpoints.
(c) The encapsulation of full IP packets within other packets.
(d) The compression of packets that increases network performance between two routers.
Packets on a VPN (Virtual Private Network) are generally secured and made tamper-proof via
(a) Encryption.
(b) Digital signatures.
(c) A combination of encryption and digital signatures.
(d) Tunneling.
A stuck bit in memory location is an example of a
(a) Fail-silent fault.
(b) Byzantine fault.
(c) Intermittent fault.
(d) Transient fault.
802.11n (Wi-Fi) uses Low-density parity-check (LDPC) to improve reliability over noisy networks. This is an example of
(a) Physical redundancy.
(b) Information redundancy.
(c) Time redundancy.
(d) Reliable multicast.
A necessary component in a cluster file system is a
(a) Distributed lock manager (DLM).
(b) Heartbeat network.
(c) Storage area network (SAN).
(d) System area network (SAN).
An application monitoring process detects a failed web server and starts a web server on backup machine. This is called
(a) Cold failover
(b) Warm failover
(c) Hot failover
(d) Cascading failover.
Which cluster is most likely to use MPI (message passing interface)?
(a) High performance computing (HPC).
(b) Batch processing in a computer graphics render farm.
(c) Load-balanced web servers.
(d) High availability cluster.
A key design principle of the Google cluster architecture is
(a) Merging is fast; break up databases into lots of pieces and use lots of processes, each working on a piece of the data.
(b) Minimize the number of phases in a task; create many replicas of a database and allow each system full access to it without the need for sharing or locking.
(c) Minimize context switch overhead; machines are cheap and it's more efficient to devote a node to one task exclusively.
(d) Machines and software fail; run the same task redundantly in parallel to ensure successful execution.