strangeGizmo.com : Articles : Direct-to-NFS

home .. articles ..
Direct-to-NFS

Abstract: eioMAIL.com used NetApp Filers to store the service’s account and e-mail data. But instead of using standard operating system calls to access that data, the eioMAIL.com servers included a custom NFS library that allowed the software to micro-manage its usage of the file store.

1 Introduction

1.1 eioMAIL.com

I invented [1] Target Revocable E-Mail in January 2000. Each user in a Target Revocable E-Mail system is given their own domain name. Every address at that domain name is routed to the user’s account and is given its own mailbox or folder. Users can revoke a mailbox to prevent further deliveries to that address (in case it starts receiving spam, for example). This system has now come to be known as a disposable e-mail address.

The Target Revocable E-Mail Corporation [2] was founded in April 2000 to deploy and market this invention. The business plan for the company included the following goals:

Help people stop spam by providing e-mail accounts that included Target Revocable E-Mail as a core feature.
Build a user-supported e-mail service instead of relying on banner ads and pop-ups to generate revenue.
Design a user interface that looked good, worked well on every web browser (even Lynx!), and was easy to use.

These goals were realized by building a platform for deploying e-mail services to businesses and ISPs. The first customer of that platform would be eioMAIL.com, a service managed by the Target Revocable E-Mail Corporation. The company itself was also a customer of the service and used the Target Revocable E-Mail system for its internal corporate e-mail.

#1:

I use the term “invented” even though it is unclear if I was the first person to come up with the concept of Target Revocable E-Mail (TRE). As far as I can tell, my system was the first to offer end-to-end TRE technology with a complete user interface.

#2:

Software patents were all the rage in 2000 when I launched the Target Revocable E-Mail Corporation. While Target Revocable E-Mail is great technology, it wasn’t the kind of thing that I felt should be patented. I still wanted some protection in case someone else decided to patent the idea, so I named the company after the technology. I hoped that the incorporation papers would serve as proof of “prior art” if I ever needed such a thing. Thankfully, I never did.

1.2 Design goals

Supporting the business plan for the Target Revocable E-Mail Corporation — and by extension, eioMAIL.com — were a number of design goals:

The service had to be cheap to build and maintain. I was funding the Target Revocable E-Mail Corporation out of my own pocket, so the operational cost of the service was a definitely a factor in its design.
I hoped to have thousands, hundreds of thousands, even millions of users one day, so the service needed a built-in way of scaling to handle additional load.
The service had to be stable and secure. It would be unacceptable to lose an e-mail that had been accepted for delivery.

I had extensive experience building network software and web applications and planned to apply that experience in the design of the Target Revocable E-Mail system. An online service is larger and more distributed than a single web application or e-mail server, but many of the concepts remain the same.

There were many decisions that contributed to the final design of the service. One of the more central decisions was the choice of a storage platform.

2 Storage Choices

The Target Revocable E-Mail system would use one of two storage methods: a distributed storage network comprised of hard drives in each of the servers or a network attached storage (NAS) device. There are advantages and disadvantages to each approach.

2.1 Local Disks

Locally-attached disks are cheap and relatively fast. If you need to store hundreds of gigabytes of data, local disks are often the way to go. There are two questions that need to be answered when going this route: where do the disks go and where does the data go?

You could put the disks into one of the machines and then make that machine a file server. You pretty much have a network attached storage device at that point, but a UNIX machine with a bunch of disks is going to be cheaper than a dedicated NAS device. On the other hand, the dedicated NAS device will probably have a number of features that you cannot (easily) get on your UNIX file server: clustering with transparent fail over, checkpoints, replication, etc.

Another option is to put disks in all of your servers and then write custom storage software that spreads your data across the network. Accessing data in the distributed network is a two step process: find the server with the data, then retrieve the data from the server. Many services do this, Google included [3], but I did not feel that it was the right solution for the Target Revocable E-Mail system.

My main concern with this approach was making sure that the data was securely stored. In a system like mine, where an e-mail message must never be lost, you need to make sure that you are storing the data in more than one place on your network. If a message is stored on only one server, and if that server were to experience a hardware failure, the message could be permanently destroyed.

To avoid this problem, you must store the data on at least two servers. If one of those servers fails, you must immediately go through your data store and make sure that the failed server was not acting as a backup data store for another server. If it was, you need to copy that data onto yet another server.

Not all online services need this level of redundancy, but our customers were paying customers and I felt that data security was an important part of our offering.

#3:

Google developed a distributed file system called the Google File System (GFS) to meet their unique storage needs. GFS is a great example of building a custom solution based upon a deep understanding of your service’s specific requirements.

2.2 Network Attached Storage

After considering the pros and cons of local disks, I decided to use a network attached storage device. I settled on the NetApp Filer, which implements the Network File System (NFS) protocol.

NetApp Filers have a number of features that I was attracted to: snapshots, clustering with live fail-over, plenty of redundancy, and the ability to increase the storage capacity without taking down the system. I was not going to use all of these features immediately, but it was nice to know that they were there.

I went so far as to put all of the data for the Target Revocable E-Mail system on the Filer, not just the e-mail messages. None of the machines had any hard drives in them; they loaded their operating system and server software from the Filer.

The decision to centralize all of the data on the Filer made it much easier to manage the storage infrastructure. It also meant that all of the servers were identical and could be easily swapped out and replaced if one of them failed.

3 NFS Internals

One of the key aspects of the NFS protocol is that NFS servers are not expected to track the state of their clients [4]. All of the information necessary to satisfy a request by the client is sent in each message to the server.

In order to simplify these messages, every file and directory on the server is identified by a unique filehandle. The filehandle is an array of bytes that only the server understands. The client is given the filehandle by the server and will use that filehandle to manipulate the data on the server.

In most (all?) NFS server implementations, filehandles are permanently attached to a file or directory and remain the same for the lifetime of the file. In other words, a client can lookup a filehandle and then work with that file for days, weeks, even months or years without the filehandle changing.

This behavior is certainly not guaranteed by the protocol, but my experience with the NetApp Filer indicates that the Filer does offer permanent filehandles.

#4:

NFS Version 4 adds a stateful mode to the protocol.

4 Building the Solution

4.1 RPC and NFS

NFS is based on the Remote Procedure Call (RPC) protocol. RPC provides a common method of communicating between distributed systems on a network. The RPC protocol usually runs over UDP, although it can also operate on TCP links.

RPC clients are available for almost every major operating system. At the time that I was developing the Target Revocable E-Mail service, these libraries were not available for Java. There are ways to access C code from Java, but the a pure-Java implementation of the RPC and NFS protocols would make my system much easier to work with.

Since I could not find an existing library, I wrote my own RPC and NFS implementations. Although my software needed to interoperate at the network level, there was no requirement that the APIs look the same as the C-based APIs.

I wanted to leverage the stateless nature of NFS, so I purposefully avoided using a standard file system API. My code would call directly into the NFS server’s READ and WRITE functions. Doing so enabled the software to optimize its usage of the NFS server on a case-by-case basis.

My one concession to a standard Java API was a custom implementation of the InputStream class. This class, when instantiated with a filehandle and byte range, would look just like any other Java InputStream. I could then pass that object to third-party libraries needed an InputStream.

I had direct access to the NFS protocol, but I still built an abstraction layer on top of my NFS libraries. This layer allowed the various Target Revocable E-Mail servers to work with “messages” and “accounts” instead of filehandles and RPC requests.

The abstraction layer identified each message by a three-tuple of message id, generation number, and filehandle. The filehandle could be encoded in a URL and then, when returned to the HTTP server, used for instantaneous access to a message.

The filehandle was not required to access the message — the numeric portion of the id was sufficient — but if available it allowed the abstraction layer to provide rapid access to individual messages.

4.2 File formats

The two most common actions in an e-mail system are listing a mailbox and reading a message. With the exception of toggling an unread flag, neither of these actions change the state of the system. To make these two operations as efficient as possible, the software went directly to the mailbox and message files when performing these tasks.

4.2.1 Record layout

My custom NFS client let me access data on the NFS server with a single RPC call, but for that to be useful I needed to know the offset in the file from which to read. As a result, all of the files in the Target Revocable E-Mail system were record-oriented.

Each entry in the mailbox index was stored at a fixed offset that could be computed from the message id. Message files had a fixed-size header; the body was always located at the same place in the file. Mailbox index files were never compacted, ensuring that records would not move while another server was accessing the file.

The order in which the contents of each file was written was also specified as part of the design. Entries in the mailbox index had a flag indicating if the message had been delivered and could be displayed to the user. The rest of the fields were always populated first before this flag was set. This prevented a server from presenting a partially-filled record to the client.

4.2.2 Ownership

Each type of file was owned by one of the server components. Mailbox index files were owned by the mailbox server. Message files were initially owned by the server that created the message (SMTP for incoming messages, HTTP for outgoing messages). The ownership of the file was then transferred to the mailbox server for final delivery.

I am using the word “ownership” in the logical sense; the actual UNIX permissions on the files never changed. The owner of the file was the server that was allowed to write to the file and, in the case of message files, delete the file.

4.3 Security

The development of an online service requires you to code defensively, especially in those parts of the system that deal with external content. Protection against malicious users is often the focus of the design, but it is important to guard the service against programmer error as well.

A number of security practices were put into place to protect the service; this article only discusses those practices that dealt with the storage infrastructure.

4.3.1 User and group permissions

Each of the servers in the Target Revocable E-Mail system was assigned a specific set of user and group permissions. The various account, index, and mailbox files were assigned owners and access restrictions as well. Only the mailbox server could modify index files; the other servers could read the files, but not change them.

The service was written in Java, so this level of security was not entirely necessary. Buffer overflows and stack manipulation attacks are nearly impossible to launch against a Java application. But I might someday optimize portions of the service in another language and wanted these security precautions to already be in place.

As an aside, the users of the service did not have their own UNIX user and group ids. That would not have scaled beyond a few thousand users. Instead, the various components of the service were assigned individual user and group ids and used these ids when communicating with the file server.

4.3.2 Filehandle validation

My system aggressively cached NFS filehandles and would often access a message using a filehandle that might have been retrieved months (or even years!) earlier. Although my research during the development phase indicated that filehandles were stable and that their target would not change, I did not want to rely on this behavior.

It was important that the service correctly deal with any errors that might result from attempting to access a stale filehandle. The worst case would be if the target of a filehandle changed without the filehandle itself becoming invalid. Though very unlikely, this could allow a user to access a message that is not part of their account.

To defend against this, each file contained the customer’s id as well as the message’s unique id. This information was checked every time a file was accessed. The system was designed from the beginning to include this extra information; as a result, a single NFS READ call could be used to retrieve the desired data as well as the file verifiers.

The worst-case situation described above was never detected by any of the servers. Most filehandles include a generation number to protect against a filehandle being used to access something other than the intended target. That said, I was not privy to the manner in which the NetApp constructed its filehandles, so I had to assume the worst.

5 Results

My idea of going directly to the NFS server instead of using the normal file system routines was something of a gamble. As far as I know, no one had ever deployed a system like this before. It was entirely possible that it would not work at all. It was also possible that the NetApp would simply refuse to accept my cached filehandles or would expire the filehandles too quickly.

But none of those things happened. Instead, the system performed flawlessly. With a single UDP round-trip the servers had instantaneous access to almost any piece of data that they needed. The service was able to maximize the potential of its storage infrastructure and avoid the unnecessary overhead of the standard file system calls.

5.1 Maximum integration

Each of the servers in the Target Revocable E-Mail system had a custom interface to the NetApp. The HTTP server encoded filehandles in URLs for fast access to messages in subsequent requests. The POP3 server maintained a cache of filehandles for the messages that it sent to the POP3 client. The SMTP server was probably the simplest of all: it only needed the NFS WRITE call.

Most of these servers went through an abstraction layer to access the file server, but this abstraction layer was carefully built to provide maximum performance without compromising the benefits of the direct-to-NFS approach.

The NetApp, instead of being an adjunct device on the network, was integrated directly into the core of the service. Finally, the code in the Target Revocable E-Mail servers was the sole initiator of NFS operations. Any performance issues could be addressed in one place without having to wonder what the operating system might be doing behind the scenes.

5.2 Centralized storage

The NetApp was the only device in the network that had any hard drives. All of the servers were diskless and loaded their root file system from the NetApp. There were two DHCP servers that had flash drives in them for loading the kernel. After the kernel was loaded, the DHCP servers used the NetApp for their root file system as well.

There were a number of advantages to this configuration. First and foremost is that all of the servers (with the exception of the DHCP cluster) were completely identical. They had the same processor, the same amount of RAM, and the same network configuration. If a server failed, any one of the generic backup servers could instantly take its place. The custom-written DHCP servers automatically managed this switch-over, using the SNMP-controlled power distribution units to reconfigure the network as required.

Although I tested all of the hardware- and network-related failure scenarios, it is worth noting that I never experienced any of these failures during the operation of the service. The servers were so simple (motherboard, CPU, RAM network card, power supply) that there wasn’t much that could go wrong.

I did have at least one hard drive fail in the NetApp. The Filer immediately took the drive out of service and replaced it with the hot-spare. It also contacted NetApp and ordered a new drive, which arrived the next day via FedEx.

5.3 Interoperability

Although not an initial requirement, the custom NFS client worked with other NFS servers as well. Most of the development was done on FreeBSD machines and the software worked flawlessly when configured to talk to a FreeBSD server.

I chose to use the UDP protocol for communication with the NetApp. This reduced the resource load on both the Filer and on my servers. I also implemented a TCP version of the RPC client. This was useful during the development phase as I could tunnel the NFS-over-RPC-over-TCP connection through an SSH connection.

5.4 Common distributed protocol

The invention of the direct-to-NFS concept affected the design of more than just the storage system. I ended up using RPC for communication between all of the servers in the Target Revocable E-Mail system. Doing so enabled me to design fault-tolerance and resource efficiency into the entire network.

Using RPC for everything also made the code simpler. There was a single Java class that was responsible for managing inbound and outbound RPC traffic for each server. This class was complete asynchronous, which helped to leverage the investment in server hardware that I had made.

5.5 Real-time upgrades

Upgrades of the Filer can, and did, take place while users were accessing the system. There would be a brief pause between requests as the Filer rebooted and then everything would continue as normal. The entire Target Revocable E-Mail system worked this way.

The service’s modular architecture and comprehensive test platform ensured that the changes, once coded and tested, could be deployed without incident. This allowed me to upgrade the software on the servers while users were still accessing the system. Active sessions were not affected and incoming e-mail would still be accepted. Even the distributed session server could be upgraded in real-time.

Users would request a feature, the feature would be added, new servers would be deployed and the feature would be available on the very next click the user made on the web site. There was no downtime while the servers were upgraded. The user didn’t even realize that anything had happened. Except that a feature that wasn’t there a few seconds ago was now available to them.

6 Conclusion

eioMAIL.com has used my direct-to-NFS technology for more than five years now. The technology has performed flawlessly and met all of the goals that I set out for it.

Would I use this technology again? It would depend on the service. Gmail has raised the bar by offering more than two gigabytes of storage to their customers. As a result, most Internet e-mail services have to place a real emphasis on storage quantity. Locally-attached storage suddenly becomes a more attractive option.

The feature set of vendor-supplied NFS servers has improved as well. Many UNIX operating systems have file system checkpoints, clustering, and high-availability features. Building your own network attached storage device is a better proposition than it was in early 2000.

Not everyone who reads this article should go out and buy a Filer, then start talking to it directly over NFS. But I do hope that this article has encouraged you to take a more outside-of-the-box approach to evaluating the design of your products and services.