Learning Out Loud

I’ve always sort of figured that this blog was a place to post things I was just learning, rather than a place to publish tutorials about technologies that I have more expertise in.

Because of the nature of our field, and the amount of new technologies that are always coming out, I’m far more interested in learning than I am in teaching.  There are already plenty of great teachers out there who are blogging, writing or lecturing.  I don’t really aspire to sell myself as a teacher of technologies—that would take far too much time and energy.

Instead, I see this blog as a forum for my attempts to learn new technologies—e.g. WPF and Silverlight.  I’m always looking for new ways to motivate myself to learn new technologies and having a blog is a good way to force myself to dive in and start learning something new.  When I realize that I haven’t posted anything for a few days, I feel the urge to start pulling together the next post.  Then, because I know I’m going to have to write about it, I find that I force myself to explore whatever the topic is in a much deeper manner than I would if I were just reading a book or attending a class.

This has been working out great so far.  I’m discouraged by how little time I have to study these new technologies.  I’d like to post far more frequently than I do.  But at least I’m gradually learning some new bits and pieces, about technologies like WPF and Silverlight.

Reminding myself of my goals also helps me to just relax and not worry so much about making mistakes.  I’m just capturing on “paper” what I’m learning, as I learn it.  Since I’m only beginning the journey of grokking whatever it is, I don’t need to worry about whether I get it right or not.

Remembering all of this led me to change the tagline of this blog.  Instead of offering up my thoughts on various topics, I now see this as “learning out loud”.  That perfectly describes what I think I’m doing—learning new stuff, stumbling through it, and capturing the current state of my knowledge so that I can come back and refer to it later.

So let the journey continue—there’s still so much to learn!

Advertisements

Using HttpWebRequest for Asynchronous Downloads

I’ve occasionally had a desire to test downloading a file via HTTP or FTP from a server—either to measure performance, or to stress test the server by kicking off a large number of simultaneous downloads.  Here’s a little Win Forms client that allows you to download a single file from a server, using either HTTP or FTP.  It shows download progress and displays the average transfer rate, in kb/sec.  It also demonstrates how to use the HttpWebRequest and FtpWebRequest classes in System.Net to do file downloads.

As an added bonus, this app is a nice example of doing a little work on a background thread and then firing callbacks back to the GUI thread to report progress.  This is done using the BeginXxx/EndXxx pattern, as well as using the Invoke method the ensure that GUI updating is done on the correct thread.  I always forget the exact syntax for this, so it’s nice to have it here to refer to.

The bulk of this code comes directly from the MSDN documentation for the HttpWebRequest.BeginGetResponse method.  I’ve created a little client app around it, adding some GUI elements to show progress.  I’ve also extended it to support downloads using FTP.

I include code snippets in this article, but you can download the entire Visual Studio 2008 solution here.

The End Result

When we’re done, we’ll have a little Win Forms app that lets us enter an HTTP or FTP path to a file and then downloads that file.  During the download, we see the progress, as well as the average transfer rate.

Download Stress Test application

For the moment, the application doesn’t actually write the file locally.  Instead, it just downloads the entire file, throwing away the data that it downloaded.  The intent is to stress the server and measure the transfer speed—not to actually get a copy of the file.

If we were to use HTTP and specify an HTML file to download, we’d basically be doing the same thing that a web browser does—downloading a web page from the server to the client.  In the example above, I download a 1.9MB Powerpoint file from the PDC conference, just so that we have something a little larger than a web page and can see some progress.

Using FTP Instead of HTTP

My little application does FTP, as well as HTTP.  If you enter an FTP-based URI, rather than an HTTP-based one, we automatically switch to using FTP to download the file.  Before the download can start, however, we need to ask the user for credentials to use to log into the FTP site.

FTP Credentials

Once we’ve gotten the FTP credentials, the download runs in much the same way that the HTTP-based download ran.

Downloading a file from an FTP server

In this case, I’m downloading an ISO image of the first CD of a Fedora distribution.  Note that the FTP response string starts with “213”, which gives file status and represents a successful response from the FTP server.  The second part of the response string is the size of the file, in bytes.  In the case of HTTP, the response was just “OK”.

Where Are We?

So what do we really have here?  A little program that downloads a single file without really writing it anywhere.  At this point, we have something that’s mildly useful for testing a server, since it tells us the transfer rate.  Furthermore, we can launch a bunch of these guys in parallel and download the same file many times in parallel, to stress the server.  (Ideally, the application would let us pick #-of-simultaneous-downloads and just kick them all off, but that’s an enhancement to be added later).

Diving Into the Source Code

More interesting than what this little program does is how you go about using the HttpWebRequest and FtpWebRequest classes to do the actual work.

Here’s a quick look at the C# solution:

Files in Solution

There’s really not much here—the main form (DownloadStressTestForm), the FTP credentials form (GetCredentialsForm) and a little helper class used to pass data around between asynchronous methods.

Most of the code lives in DownloadStressTestForm.cs.  Ideally, we’d split this out into the GUI pieces and the actual plumbing code that does the work of downloading the files.  But this is just a quick-and-dirty project.

Push the Button

Let’s take a look at the code that fires when you click the Get File button.

        private void btnGetFile_Click(object sender, EventArgs e)
        {
            try
            {
                lblDownloadComplete.Visible = false;

                WebRequest req = null;
                WebRequestState reqState = null;
                Uri fileURI = new Uri(txtURI.Text);

                if (fileURI.Scheme == Uri.UriSchemeHttp)
                {
                    req = (HttpWebRequest)HttpWebRequest.Create(fileURI);
                    reqState = new HttpWebRequestState(BUFFER_SIZE);
                    reqState.request = req;
                }
                else if (fileURI.Scheme == Uri.UriSchemeFtp)
                {
                    // Get credentials
                    GetCredentialsForm frmCreds = new GetCredentialsForm();
                    DialogResult result = frmCreds.ShowDialog();
                    if (result == DialogResult.OK)
                    {
                        req = (FtpWebRequest)FtpWebRequest.Create(fileURI);
                        req.Credentials = new NetworkCredential(frmCreds.Username, frmCreds.Password);
                        reqState = new FtpWebRequestState(BUFFER_SIZE);

                        // Set FTP-specific stuff
                        ((FtpWebRequest)req).KeepAlive = false;

                        // First thing we do is get file size.  2nd step, done later,
                        // will be to download actual file.
                        ((FtpWebRequest)req).Method = WebRequestMethods.Ftp.GetFileSize;
                        reqState.FTPMethod = WebRequestMethods.Ftp.GetFileSize;

                        reqState.request = req;
                    }
                    else
                        req = null;	// abort

                }
                else
                    MessageBox.Show("URL must be either http://xxx or ftp://xxxx");

                if (req != null)
                {
                    reqState.fileURI = fileURI;
                    reqState.respInfoCB = new ResponseInfoDelegate(SetResponseInfo);
                    reqState.progCB = new ProgressDelegate(Progress);
                    reqState.doneCB = new DoneDelegate(Done);
                    reqState.transferStart = DateTime.Now;

                    // Start the asynchronous request.
                    IAsyncResult result =
                      (IAsyncResult)req.BeginGetResponse(new AsyncCallback(RespCallback), reqState);
                }
            }
            catch (Exception ex)
            {
                MessageBox.Show(string.Format("EXC in button1_Click(): {0}", ex.Message));
            }
        }

The basic goal here is to create an instance of either the HttpWebRequest or FtpWebRequest class.  This is done by calling the corresponding Create method, and passing it the URI that the user entered.  Note that we use the Uri class to figure out if the user is entering an HTTP or an FTP URI.  We create an instance of the base class, WebRequest, which we’ll use to kick everything off.

We also create an instance of a class used to store some state information, either HttpWebRequestState or FtpWebRequestState.  These classes both derive from WebRequestState and are defined in this project, in WebRequestState.cs.

The idea of this state object is that we’ll hand it off to the asynchronous method that we use to do the actual download.  It then will get passed back to the callback that fires when an asynchronous method completes.  Think of it as a little suitcase of stuff that we want to carry around with us and hand off between the asynchronous methods.

Notice that if we’re doing an FTP transfer, we first pop up the credentials dialog to get the Username and Password from the user.  We then store those credentials in the FtpWebRequest object.

There’s one other difference between HTTP and FTP.  In the case of HTTP, we’ll fire off a single web request, with a GET command, to download the file.  But for FTP, we actually first send a command to read the file size, followed by the command to actually download the file.  To accomplish this, we set the Method property of the FtpWebRequest to WebRequestMethods.Ftp.GetFileSize.  We don’t set this property for HttpWebRequest because it just defaults to the GET command, which is what we want.

Towards the end of this function, you’ll see that I’m loading up the suitcase—setting the various properties of the WebRequestState object.  Along with the URI, we set up some delegates to point back to three callbacks in the DownloadStressTestForm class—SetResponseInfo, Progress, and Done.  These are the callbacks that actually update our user interface—when things start, during the file transfer, and when the file has finished downloading.

Finally, we call the BeginGetResponse method to actually launch the download.  Here, we specify a response callback—the method that will get called, not when the download has completed, but just when we get the actual HTTP response, or when the FTP command completes.  In the case of HTTP, we first get the response packet and then start reading the actual file using a stream that we get from the response.

What’s important here is that we push the work into the background, on another thread, as soon as possible.  We don’t do much work in the button click event handler before calling BeginGetResponse.  And this method is asynchronous—so we return control to the GUI immediately.  From this point on, we will only update the GUI in response to a callback.

Callbacks to Update the GUI

I mentioned the three callbacks above that we use to update the user interface—SetResponseInfo, Progress, and Done.  Here’s the declaration of the delegate types:

    public delegate void ResponseInfoDelegate(string statusDescr, string contentLength);
    public delegate void ProgressDelegate(int totalBytes, double pctComplete, double transferRate);
    public delegate void DoneDelegate();

And here are the bodies of each of these three callbacks, as implemented in DownloadStressTestForm.

        /// <summary>
        /// Response info callback, called after we get HTTP response header.
        /// Used to set info in GUI about max download size.
        /// </summary>
        private void SetResponseInfo(string statusDescr, string contentLength)
        {
            if (this.InvokeRequired)
            {
                ResponseInfoDelegate del = new ResponseInfoDelegate(SetResponseInfo);
                this.Invoke(del, new object[] { statusDescr, contentLength });
            }
            else
            {
                lblStatusDescr.Text = statusDescr;
                lblContentLength.Text = contentLength;
            }
        }

        /// <summary>
        /// Progress callback, called when we've read another packet of data.
        /// Used to set info in GUI on % complete & transfer rate.
        /// </summary>
        private void Progress(int totalBytes, double pctComplete, double transferRate)
        {
            if (this.InvokeRequired)
            {
                ProgressDelegate del = new ProgressDelegate(Progress);
                this.Invoke(del, new object[] { totalBytes, pctComplete, transferRate });
            }
            else
            {
                lblBytesRead.Text = totalBytes.ToString();
                progressBar1.Value = (int)pctComplete;
                lblRate.Text = transferRate.ToString("f0");
            }
        }

        /// <summary>
        /// GUI-updating callback called when download has completed.
        /// </summary>
        private void Done()
        {
            if (this.InvokeRequired)
            {
                DoneDelegate del = new DoneDelegate(Done);
                this.Invoke(del, new object[] { });
            }
            else
            {
                progressBar1.Value = 0;
                lblDownloadComplete.Visible = true;
            }
        }

This is pretty simple stuff.  To start with, notice the common pattern in each method, where we check InvokeRequired.  Remember the primary rule about updating controls in a user interface and asynchronous programming: the controls must be updated by the same thread that created them.  InvokeRequired tells us if we’re on the right thread or not.  If not, we use the Invoke method to recursively call ourselves, but on the thread that created the control (the one that owns the window handle).

Make note of this InvokeRequired / Invoke pattern.  You’ll use it whenever you’re doing background work on another thread and then you want to return some information back to the GUI.

The work that these callbacks do is very simple.  SetResponseInfo is called when we first get the reponse packet, as we start downloading the file.  We get an indication of the file size, which we write to the GUI.  Progress is called for each packet that we download.  We update the labels that indicate # bytes received and average transfer rate, as well as the main progress bar.  Done is called when we’re all done transfering the file.

The Response Callback

Let’s go back to where we called the WebRequest.BeginGetResponse method.  We we called this method, we specified our RespCallback as the method to get invoked when the response packet was received.  Here’s the code:

        /// <summary>
        /// Main response callback, invoked once we have first Response packet from
        /// server.  This is where we initiate the actual file transfer, reading from
        /// a stream.
        /// </summary>
        private static void RespCallback(IAsyncResult asyncResult)
        {
            try
            {
                // Will be either HttpWebRequestState or FtpWebRequestState
                WebRequestState reqState = ((WebRequestState)(asyncResult.AsyncState));
                WebRequest req = reqState.request;
                string statusDescr = "";
                string contentLength = "";

                // HTTP
                if (reqState.fileURI.Scheme == Uri.UriSchemeHttp)
                {
                    HttpWebResponse resp = ((HttpWebResponse)(req.EndGetResponse(asyncResult)));
                    reqState.response = resp;
                    statusDescr = resp.StatusDescription;
                    reqState.totalBytes = reqState.response.ContentLength;
                    contentLength = reqState.response.ContentLength.ToString();   // # bytes
                }

                // FTP part 1 - response to GetFileSize command
                else if ((reqState.fileURI.Scheme == Uri.UriSchemeFtp) &&
                         (reqState.FTPMethod == WebRequestMethods.Ftp.GetFileSize))
                {
                    // First FTP command was GetFileSize, so this 1st response is the size of
                    // the file.
                    FtpWebResponse resp = ((FtpWebResponse)(req.EndGetResponse(asyncResult)));
                    statusDescr = resp.StatusDescription;
                    reqState.totalBytes = resp.ContentLength;
                    contentLength = resp.ContentLength.ToString();   // # bytes
                }

                // FTP part 2 - response to DownloadFile command
                else if ((reqState.fileURI.Scheme == Uri.UriSchemeFtp) &&
                         (reqState.FTPMethod == WebRequestMethods.Ftp.DownloadFile))
                {
                    FtpWebResponse resp = ((FtpWebResponse)(req.EndGetResponse(asyncResult)));
                    reqState.response = resp;
                }

                else
                    throw new ApplicationException("Unexpected URI");

                // Get this info back to the GUI -- max # bytes, so we can do progress bar
                if (statusDescr != "")
                    reqState.respInfoCB(statusDescr, contentLength);

                // FTP part 1 done, need to kick off 2nd FTP request to get the actual file
                if ((reqState.fileURI.Scheme == Uri.UriSchemeFtp) && (reqState.FTPMethod == WebRequestMethods.Ftp.GetFileSize))
                {
                    // Note: Need to create a new FtpWebRequest, because we're not allowed to change .Method after
                    // we've already submitted the earlier request.  I.e. FtpWebRequest not recyclable.
                    // So create a new request, moving everything we need over to it.
                    FtpWebRequest req2 = (FtpWebRequest)FtpWebRequest.Create(reqState.fileURI);
                    req2.Credentials = req.Credentials;
                    req2.UseBinary = true;
                    req2.KeepAlive = true;
                    req2.Method = WebRequestMethods.Ftp.DownloadFile;

                    reqState.request = req2;
                    reqState.FTPMethod = WebRequestMethods.Ftp.DownloadFile;

                    // Start the asynchronous request, which will call back into this same method
                    IAsyncResult result =
                      (IAsyncResult)req2.BeginGetResponse(new AsyncCallback(RespCallback), reqState);
                }
                else    // HTTP or FTP part 2 -- we're ready for the actual file download
                {
                    // Set up a stream, for reading response data into it
                    Stream responseStream = reqState.response.GetResponseStream();
                    reqState.streamResponse = responseStream;

                    // Begin reading contents of the response data
                    IAsyncResult ar = responseStream.BeginRead(reqState.bufferRead, 0, BUFFER_SIZE, new AsyncCallback(ReadCallback), reqState);
                }

                return;
            }
            catch (Exception ex)
            {
                MessageBox.Show(string.Format("EXC in RespCallback(): {0}", ex.Message));
            }
        }

The first thing that we do in this method is to open our suitcase–our WebRequestState object, which comes back in the AsyncState property of the IAsyncResult.

The other main thing that we do in this method is to get the actual WebResponse object.  This contains the information that we actually got back from the server.  We do this by calling the EndGetResponse method.

Notice the standard Begin/End pattern for asynchronous programming here.  We could have done all of this synchronously, by calling GetResponse on the original HttpWebRequest (or FtpWebRequest object).  GetResponse would have returned an HttpWebResponse (or FtpWebResponse object).  Instead, we call BeginGetResponse to launch the asynchronous method and then call EndGetResponse in the callback to get the actual result—the WebResponse object.

At this point, the first thing that we want from the response packet is an indication of the length of the file that we’re downloading.  We get that from the ContentLength property.

It’s also at this point that we call the ResponseInfo delegate, passing it the status string and content length, to update the GUI.  (Using the respInfoCB field in the WebRequestState object).

Let’s ignore FTP for the moment and look at the final main thing that we do in this method—get a stream object and kick off a read of the first packet.  We get the stream from that WebReponse object and then go asynchronous again by calling the BeginRead method.  Are you seeing a pattern yet?  Again, if we wanted to do everything synchronously, we could just set up a loop here and call the stream’s Read method to read each buffer of data.  But instead, we fire up an asynchronous read, specifying our method that should be called when we get the first packet/buffer of data—ReadCallback.

FTP Download, Step 2

Let’s go back to how we’re doing FTP.  Remember that we set the FtpWebRequest.Method property to GetFileSize.  And in ReadCallback, if we see that we just did that first command, we send the file size back to the GUI.  And then we’re ready to launch the 2nd FTP command, which is DownloadFile.  We do this by creating a 2nd FtpWebRequest and calling the BeginGetResponse method again.  And once again, when the asynchronous method completes, we’ll get control back in ReadCallback.  We don’t risk recursing indefinitely because we store an indication of which command we’re doing in our suitcase—in WebRequestState.FTPMethod.

Gettin’ the Data

Finally, let’s take a look at the code where we actually get a chunk of data from the server.  First, a quick note about buffer size.  Notice that when I called BeginRead, I specified a buffer size using the BUFFER_SIZE constant.  For the record, I’m using a value of 1448 here, which is based on the size of a typical TCP packet (packet size less some header info).  We could really use any value here that we liked—it just seemed reasonable to ask for the data a packet at a time.

Here’s the code for our read callback, which first fires when the first packet is received, after calling BeginRead.

        /// <summary>
        /// Main callback invoked in response to the Stream.BeginRead method, when we have some data.
        /// </summary>
        private static void ReadCallback(IAsyncResult asyncResult)
        {
            try
            {
                // Will be either HttpWebRequestState or FtpWebRequestState
                WebRequestState reqState = ((WebRequestState)(asyncResult.AsyncState));

                Stream responseStream = reqState.streamResponse;

                // Get results of read operation
                int bytesRead = responseStream.EndRead(asyncResult);

                // Got some data, need to read more
                if (bytesRead > 0)
                {
                    // Report some progress, including total # bytes read, % complete, and transfer rate
                    reqState.bytesRead += bytesRead;
                    double pctComplete = ((double)reqState.bytesRead / (double)reqState.totalBytes) * 100.0f;

                    // Note: bytesRead/totalMS is in bytes/ms.  Convert to kb/sec.
                    TimeSpan totalTime = DateTime.Now - reqState.transferStart;
                    double kbPerSec = (reqState.bytesRead * 1000.0f) / (totalTime.TotalMilliseconds * 1024.0f);

                    reqState.progCB(reqState.bytesRead, pctComplete, kbPerSec);

                    // Kick off another read
                    IAsyncResult ar = responseStream.BeginRead(reqState.bufferRead, 0, BUFFER_SIZE, new AsyncCallback(ReadCallback), reqState);
                    return;
                }

                // EndRead returned 0, so no more data to be read
                else
                {
                    responseStream.Close();
                    reqState.response.Close();
                    reqState.doneCB();
                }
            }
            catch (Exception ex)
            {
                MessageBox.Show(string.Format("EXC in ReadCallback(): {0}", ex.Message));
            }
        }

As I’m so fond of saying, this is pretty simple stuff.  Once again, we make use of recursion, because we’re asynchronously reading a packet at a time.  We get the stream object out of our suitcase and then call EndRead to get an indication of how many bytes were read.  This is the indicator that will tell us when we’re done reading the data—in which case # bytes read will be 0.

If we’re all done reading the data, we close down our stream and WebResponse object, before calling our final GUI callback to tell the GUI that we’re done.

But if we did read some data, we first call our progress callback to tell the GUI that we got another packet and then we fire off another BeginRead.  (Which will, of course, lead to our landing back in the ReadCallback method when the next packet completes).

You can see that we’re passing back some basic info to the GUI—total # bytes read, the % complete, and the calculated average transfer rate, in KB per second.

If we actually cared about the data itself, we could find it in our suitcase—in WebRequestState.bufferRead.  This is just a byte array that we specified when we called BeginRead.  In the case of this application, we don’t care about the actual data, so we don’t do anything with it.

Opening the Suitcase

We’ve looked at basically all the code, except for the implementation of the WebRequestState class that we’ve been using as our “suitcase”.  Here’s the base class:

    /// <summary>
    /// Base class for state object that gets passed around amongst async methods
    /// when doing async web request/response for data transfer.  We store basic
    /// things that track current state of a download, including # bytes transfered,
    /// as well as some async callbacks that will get invoked at various points.
    /// </summary>
    abstract public class WebRequestState
    {
        public int bytesRead;           // # bytes read during current transfer
        public long totalBytes;            // Total bytes to read
        public double progIncrement;    // delta % for each buffer read
        public Stream streamResponse;    // Stream to read from
        public byte[] bufferRead;        // Buffer to read data into
        public Uri fileURI;                // Uri of object being downloaded
        public string FTPMethod;        // What was the previous FTP command?  (e.g. get file size vs. download)
        public DateTime transferStart;  // Used for tracking xfr rate

        // Callbacks for response packet info & progress
        public ResponseInfoDelegate respInfoCB;
        public ProgressDelegate progCB;
        public DoneDelegate doneCB;

        private WebRequest _request;
        public virtual WebRequest request
        {
            get { return null; }
            set { _request = value; }
        }

        private WebResponse _response;
        public virtual WebResponse response
        {
            get { return null; }
            set { _response = value; }
        }

        public WebRequestState(int buffSize)
        {
            bytesRead = 0;
            bufferRead = new byte[buffSize];
            streamResponse = null;
        }
    }

This is just all of the stuff that we wanted to pass around between our asynchronous methods.  You’ll see our three delegates, for calling back to the GUI.  And you’ll also see where we store our WebRequest and WebResponse objects.

The final thing to look at is the code, also in WebRequestState.cs, for the two derived classes—HttpWebRequestState and FtpWebRequestState.

    /// <summary>
    /// State object for HTTP transfers
    /// </summary>
    public class HttpWebRequestState : WebRequestState
    {
        private HttpWebRequest _request;
        public override WebRequest request
        {
            get
            {
                return _request;
            }
            set
            {
                _request = (HttpWebRequest)value;
            }
        }

        private HttpWebResponse _response;
        public override WebResponse response
        {
            get
            {
                return _response;
            }
            set
            {
                _response = (HttpWebResponse)value;
            }
        }

        public HttpWebRequestState(int buffSize) : base(buffSize) { }
    }

    /// <summary>
    /// State object for FTP transfers
    /// </summary>
    public class FtpWebRequestState : WebRequestState
    {
        private FtpWebRequest _request;
        public override WebRequest request
        {
            get
            {
                return _request;
            }
            set
            {
                _request = (FtpWebRequest)value;
            }
        }

        private FtpWebResponse _response;
        public override WebResponse response
        {
            get
            {
                return _response;
            }
            set
            {
                _response = (FtpWebResponse)value;
            }
        }

        public FtpWebRequestState(int buffSize) : base(buffSize) { }
    }

The whole point of these classes is to allow us to override the request and response fields in the base class with strong-typed instances—e.g. HttpWebRequest and HttpWebResponse.

Wrapping Up

That’s about it—that’s really all that’s required to implement a very simple HTTP or FTP client application, using the HttpWebRequest and FtpWebRequest classes in System.Net.

This is still a pretty crude application and there are a number of obvious next steps that we could take if we wanted to improve it:

  • Allow user to pick # downloads and kick off simultaneous downloads, each with their own progress bar
  • Prevent clicking Get File button if a download is already in progress.  (Try it—you do actually get a 2nd download, but the progress bar goes whacky trying to report on both at the same time).
  • Add a timer so that we can recover if a transfer times out
  • Allow the user to actually store the data to a local file
  • Log the results somewhere, especially if we launched multiple downloads

Why You Need a Backup Plan

Everyone has a backup plan.  Whether you have one that you follow carefully or whether you’ve never even thought about backups, you have a plan in place.  Whatever you are doing or not doing constitutes your backup plan.

I would propose that the three most common backup plans that people follow are:

  1. Remain completely ignorant of the need to back up files
  2. Vaguely know that you should back up your PC, but not really understand what this means
  3. Fully realize the dangers of going without backups and do occasional manual backups, but procrastinate coming up with a plan to do it regularly

Plan #1 is most commonly practiced by less technical folk—i.e. your parents, your brother-in-law, or your local pizza place.  These people can hardly be faulted.  The computer has always remembered everything that they’ve told it, so how could it actually lose something?  (Your pizza guy was unpleasantly reminded of this when his browser informed his wife that the “Tomato Sauce Babes” site was one of his favorite sites).  When these people lose something, they become angry and will likely never trust computers again.

Plan #2 is followed by people who used to follow plan #1, but graduated to plan #2 after accidentally deleting an important file and then blindly trying various things they didn’t understand—including emptying their Recycle Bin.  They now understand that bad things can happen.  (You can also qualify for advancement from plan #1 to #2 if you’ve ever done the following—spent hours editing a document, closed it without first saving, and then clicked No when asked “Do you want to save changes to your document”)?  Although this group understands the dangers of losing stuff, they don’t really know what they can do to protect their data.

Plan #3 is what most of us techies have used for many years.  We do occasional full backups of our system and we may even configure a backup tool to do regular automated backups to a network drive.  But we quickly become complacent and forget to check to see if the backups are still getting done.  Or we forget to add newly created directories to our backup configuration.  How many of us are confident that we have regular backups occurring until the day that we need to restore a file and discover nothing but a one line .log file in our backup directory that simply says “directory not found”?

Shame on us.  If we’ve been working in software development or IT for any length of time, bad things definitely have happened to us.  So we should know better.

Here’s a little test.  When you’re working in Microsoft Word, how often do you press Ctrl-S?  Only after you’ve been slaving away for two hours, writing the killer memo?  Or do you save after every paragraph (or sentence)?  Most of us have suffered one of those “holy f**k” moments at some point in our career.  And now we do know better.

How to Lose Your Data

There are lots of different ways to lose data.  Most of us know to “save early and often” when working on a document because we know that we can’t back up what’s not even on the disk.  But when it comes to actual disk crashes (or worse), we become complacent.  This is certainly true for me.  I had a hard disk crash in 1997 and lost some things that were important to me.  For the next few months, I did regular backups like some sort of data protection zealot.  But I haven’t had a true crash since then—and my backup habits have gradually deteriorated, as I slowly regained my confidence in the reliability of my hard drives.

After all, I’ve read that typical hard drives have an MTBF (Mean Time Between Failures) of 1,000,000 hours.  That works out to 114 years, so I should be okay, right?

No.  MTBF numbers for drives don’t mean that your hard drive is guaranteed (or even expected) to run for many years before encountering an error.  Your MTBF number might be 30 years, but if the service life of your drive is only five years, then you can expect failures on your drive to start becoming more frequent after five years.  The 30 year MTBF means that, statistically, if you were running six drives for that five year period, one of the drives would see a failure at the end of the five years.  In other words, you saw a failure after 30 drive-years—spread across all six drives.  If we were running 30 drives at the same time, we’d expect our first failure on one of those drives after the first year.  (Click here for more  information on MTBF).

In point of fact, your drive might fail the first year.  Or the first day.

And hard drive crashes aren’t the only, or even the most common, type of data loss.  A recent PC World story refers to a study saying that over 300,000 laptops are lost each year from major U.S. airports and not reclaimed.  What about power outages?  Applications that crash and corrupt the file that they were working with?  (Excel did this to me once).  Flood/fire/earthquake?  Or just plain stupidity?  (Delete is right next to Rename in the Windows Explorer context menu).

A Good Backup Plan

So we’re back to where we started.  You definitely need a backup plan.  And you need something better than the default plans listed above.

You need a backup plan that:

  • Runs automatically, without your having to remember to do something
  • Runs often enough to protect data that changes frequently
  • Copies things not just off-disk, or off-computer, but off-site
  • Allows restoring lost data in a reasonably straightforward manner
  • Secures your data, as well as backing it up (when appropriate)
  • Allows access to old data even after you’ve intentionally deleted it from your PC
  • Refreshes backed data regularly, or stores the data on media that will last a long time

The most important attribute of a good backup plan, by far, is that it is automated.  When I was in college, I used to do weekly backups of my entire PC to a stack of floppies, and then haul the floppies to my parents’ house when I’d visit on Sunday.  But when the last few weeks of the semester rolled around, I was typically so busy with papers and cramming that I didn’t have time to babysit a stack of floppies while doing backups.  So I’d skip doing them for a few weeks—at the same time that I was creating a lot of important new school-related data.

How often should your data get backed up?  The answer is–more frequently than the amount of time that you would not want to have to spend reproducing the data.  Reentering a day’s worth of data into Quicken isn’t too painful.  But reentering a full month’s worth probably is—so nightly backups make sense if you use Quicken every day.  On the other hand, when I’m working on some important document that I’ve spent hours editing, I typically back the file up several times an hour.  Losing 10-15 minutes’ worth of work is my pain point.

Off-site backups are important, but often overlooked.  The more destructive the type of data loss, the farther away from the original the backup should be, to keep it safe.  For an accidental fat-finger deletion, a copy in a different directory is sufficient.  Hard drive crash?  The file should be on a different drive.  PC hit by a voltage spike?  The file should be on a different machine.  Fire or flood?  You’d better have a copy at another location if you want to be able to restore it.  The exercise is this—imagine all the bad things that might happen to your data and then decide where to put the data to keep it safe.  If you live in San Francisco and you’re planning for the Big One of ’09, then don’t just store your backups at a buddy’s house down the street.  Send the data to a family member in Chicago.

If you do lose data, you ought to be able to quickly: a) find the data that you lost and b) get that data back again.  If you do full backups once a year to some arcane tape format and then do daily incremental backups, also to tape, how long will it take you to find and restore a clean copy of a single corrupted file?  How long will it take you to completely restore an entire drive that went bad?  Pay attention to the format of your backups and the processes and tools needed to get at your archives.  It should be very easy to find and restore something when you need it.

How concerned are you with the idea of someone else gaining access to your data?  When it comes to privacy, all data is not created equal.  You likely wouldn’t care much if someone got a hold of your Mario Kart high scores.  (In fact, some of you are apparently geeky enough to have already published them).  On the other hand, you wouldn’t be too happy if someone got a copy of that text file where you store your credit card numbers and bank passwords.  No matter how much you trust the tool vendor or service that you’re using for backups, you ought to encrypt any data that you wouldn’t want handed out at a local biker bar.  Actually, this data should already be encrypted on your PC anyway—no matter how physically secure you think your PC is.

We might be tempted to think that the ideal backup plan would be to somehow have all of your data continuously replicated on a system located somewhere else.  Whenever you create or change a file, the changes would be instantly replicated on the other system.  Now you have a perfect replica of all your work, at another location, all of the time.  The problem with this approach is that if you delete a file or directory and then later decide that you wanted it back, it’s too late.  The file will have already been deleted from your backup server.  So, while mirroring data is a good strategy in some cases, you should also have a way to take snapshots of your data and then to leave the snapshots untouched.  (Take a look at the Wayback Machine at the Internet Archive for an example of data archival).

On the other hand, you don’t want to just archive data off to some medium and then never touch it again, expecting the media to last forever.  If you moved precious family photos off of your hard disk and burned them to CDs, do you expect the data on the CDs to be there forever?  Are you figuring that you’ll pass the stack of CDs on to your kids?  A lot has been written about media longevity, but I’ve read that cheaply burned CDs and DVDs may last no longer than 12-24 months.  You need a plan that re-archives your data periodically, to new media or even new types of media.  And ideally, you are archiving multiple copies of everything to protect against problems with the media itself.

How Important Is This?

The critical question to ask yourself is–how precious is my data to me?  Your answer will guide you in coming up with a backup plan that is as failsafe as you need it to be.  Your most important data deserves to be obsessed over.  You probably have thousands of family photos that exist only digitally.  They should be backed up often, in multiple formats, to multiple locations.  One of the best ways to protect data from loss is to disseminate it as widely as possible.  So maybe in addition to multiple backups, your best bet is to print physical copies of these photos and send boxes of photos to family members in several different states.

The bottom line is that you need a backup plan that you’ve come up with deliberately and one that you are following all of the time.  Your data is too important to trust to chance, or to a plan that depends on your remembering to do backups from time to time.  A deliberate plan, coupled with a healthy amount of paranoia, is the best way to keep your data safe.

Next Time

In my next post, I’ll put together a list of various products and services that can help you with backups.  And I’ll share my own backup plan (imperfect as it is).

Confessions of a Podcastaholic

I’ve been an iPod user for about a year and a half now.  I’m an obsessive music lover and collector, but I waited quite a while before I bought my first MP3 player.  My rationale was that I didn’t just want something that would let me carry around an album or two.  If that was the case, I’d be constantly moving music onto the player and off again.  Instead, I wanted to wait until I could buy something that could store my entire music collection–or at least enough of it that I’d be able to carry around a good percentage of my collection with me.

I’d been digitizing my CDs for years and enjoying listening to them on my PC, working my way through various player software.  Pre-iPod, I’d eventually settled on the RealPlayer application for managing all my music.  But it never crossed the line to become a truly great application for me.  I liked the idea of being able to organize everything into multiple playlists and then play through a playlist on shuffle mode.  But the biggest pain point was still that my music was stuck in one physical location–on one physical PC.

Like a lot of software developers, I like to occasionally wear headphones while at work.  Working in cubeland, this is often necessary, given the noise and distractions.  I hauled an old laptop into work at one point, after copying much of my music collection to it.  So I now had my music in two different places–on my home PC and at work.  I could also now make the statement that I had an MP3 player of sorts, albeit a 6 pound one that took a few minutes to boot up.

At some point it dawned on me that I should just ditch my old laptop and upgrade to one of the latest and greatest iPods.  At the time, the 80GB video model was the largest one available.  I plunked down my money and after a short wait got my first ever Apple product.

It was incredible.  As expected, it just worked.  It took some time, but I gradually started moving my music collection onto the iPod.  I had plenty of room with 80GB, and I was pleased that I’d finally found what I wanted–a single device where I could store my entire music collection.  It also truly amazed me when I realized one day that my “little music player” had a hard drive twice as large as my Windows development laptop at work.  Ok, granted, my company tends to cheap out on PCs.  But still–here I am a well-paid software developer and I have 2x the space on a deck-of-cards device as I do on my development machine.  Wow.

This is the point where my life really began to change.  Having my entire CD collection, going back 25 years, in my pocket was truly astounding.  But I quickly discovered the true killer app of the iPod–podcasts.  Before I bought the iPod, I had some vague notion that there were podcasts out there and I understood the basic concept.  But I’d not planned on listening to podcasts at all–I’d bought the iPod solely as a music device.

The podcast habit started when, out of curiosity, I began to listen to some of the more popular tech/software podcasts–This Week in Tech and .NET Rocks.  I quickly added daily news, more technical stuff, and a bunch of family history related podcasts.  I just couldn’t get enough–I became a complete podcastaholic.  Just one month into my iPod experience, I was listening to podcasts during my commute, while at work, and late into the evening.  I was hooked.  At some point, I realized that it had become rare for me to listen to music anymore.  I was using the iPod exclusively to listen to podcasts.

Podcasts became a huge hit for me for two reasons.  For starters, it was just so darn easy to get the music onto the device.  I left iTunes running constantly on my PC at home and plugged the iPod in every night, which meant that I’d automatically get all the latest episodes of everything the next morning.  Better yet, Mr. Jobs was clever enough to remove the podcasts that I’d already listened to.  Nothing could be easier.

The second biggie for me was just the excellent content that was available.  It was reminiscent of hunting for good programming on public radio, except that I had about a thousand times the number of programs to choose from.  So instead of getting Science Friday (fairly interesting, mildly relevant), I was now listening to .NET Rocks with Carl and Richard twice a week (very energizing and hugely relevant).  I was in absolute techie heaven!

The great thing is how dynamic the podcast universe is.  Podcasts are born and die all the time, with new content showing up almost daily.  I try to go back to iTunes every few weeks and just do some browsing.  And it seems like I always stumble on something new, interesting, and worth listening to.

Eighteen months into my podcast experience, I haven’t slowed down and I’m as much a podcastaholic as ever–even more so.  With plenty of house projects to work on and a huge lawn to mow, Carl and Richard now accompany me on the riding mower–along with Leo, Paul Thurrott, Robert Heron and even those wacky Digg guys from time to time.

Here’s my current podcast lineup.  These are the podcasts that I listen to fairly regularly and I can highly recommend everything on these lists.

Audio Podcasts

– A Prairie Home Companion’s News from Lake Wobegon – weekly, 15 mins – I’ve been listening to PHC since the early 80s and now I no longer miss the core Keillor experience (NFLW).
– Garrison Keillor’s The Writer’s Almanac – daily, 5 mins – Nice little bit of daily history (whose birthday is it today), along with a poem
– The ASP.NET Podcast by Wally McClure and Paul Glavich – every few days, variable – Wally is easy to listen to and you’ll get plenty of ASP.NET goodness
– Dear Myrtle’s Family History Hour – weekly, 1 hr – a bit too quaint for my tastes, but often some nice family history gems
– Entrepreneurial Thought Leaders – weekly (seasonal), 1 hr  – Excellent lecture series out of Stanford, wonderful speakers
– Front Page – daily, 5 mins – NY Times front page overview, good quick news hit
– Genealogy Gems – biweekly(?), 45 mins – Lisa Cooke’s excellent genealogy podcasts
– The Genealogy Guys Podcast – weekly, 1 hr – very solid genealogy stuff, weekly news & more
– Hanselminutes – weekly, 40 mins – One of my favorites, Scott Hanselman helps you grok the coolest new technologies
– History According to Bob – daily, 10-15 mins – Bob is a history professor and relentless podcaster.  Excellent stuff.
– .NET Rocks – 2/wk, 1 hr – Absolute must-listen for anyone doing .NET.  Great, great material.
– net@night – weekly, 1 hr – Leo Laporte and Amber MacArthur, with weekly web gossip.
– News.com daily podcast from CNET – daily, 10 mins – Good daily tech news overview
– NPR 7PM ET News Summary – daily, 5 mins – Another little daily news blurb.
– Polymorphic Podcast – sporadic, 45 mins – Craig Shoemaker, sometimes good stuff on patterns, bit spotty lately
– Roz Rows the Pacific – every 2 days, 25 mins – Roz Savage is podcasting 3 times/wk as she rows across the Pacific.
– Security Now – weekly, 1+ hrs – Steve Gibson on all things security.  Deeply technical and not to be missed.
– stackoverflow – Weekly, 1 hr – New podcast, with Joel Spolsky and Jeff Atwood.  Both very insightful on software dev topics
– This week in Tech – weekly, 1.5 hrs – Leo Laporte’s flagship podcast.  Can be more fluff than content, but a fun listen
– Windows Weekly – weekly, 1+ hrs – One of the highest quality podcasts available, Paul Thurrott w/excellent stuff on Windows

Video Podcasts

– Democracy Now! – daily, 1 hr – I don’t always have time for it, but Amy Goodman is true journalism, pure gold.
– Diggnation – weekly, 1 hr – absolute fluff, but sometimes fun to watch Kevin and Alex gossip
– dl.tv – weekly, 1/2 hr – Tied for 1st place w/Tekzilla as best techie show, great content
– Gametrailers.com – XBox 360 spotlight – daily (multiple), 2-3 mins – some great game trailers
– Geekbrief.TV – daily, 3-4 mins – Cali Lewis, quick recap of latest cool gadgets
– Mahalo Daily – daily, 5 mins – more entertainment than tech content, but sometimes some interesting stuff
– Tekzilla – daily, 1-2 mins (weekly, 40 mins) – Excellent techie show, with Patrick Norton & Veronica Belmont
– X-Plays daily video podcast – daily, 2-3 mins – Video game reviews

Looking back, I realize that I did go through an evolution in how I listen to music when I bought the iPod.  Although I started out thinking that I just wanted a convenient way to listen to my CDs, I was of course moving from CDs to a world where all my music is digital, and stored as MP3s.  This is truly an evolutionary step, in the same way that moving from vinyl to CDs was, back in the 1980s.  I do still buy lots of CDs, but only because I object so strongly to DRM.  The moment I pull a new CD out of its shrink wrap, it gets ripped, stored and sprinkled into various playlists.  I now have dozens of CDs that I’ve purchased, but never actually listened to on a CD player.

But as life-changing as it’s been to evolve my music listening habits, the amazing thing is that CDs to MP3s was a subtle life shift, compared to how podcasts have changed things for me.  I have access to so much wonderful content and in such a convenient form factor.  And my kludgy setup–iTunes on the PC and nightly synchs–will likely soon be replaced by something much more convenient and seamless.

The great thing is that the podcast revolution is just getting started.  Or maybe we should call it the user-generated content revolution.  Media is beginning to change in ways that most people just can’t imagine.  We are just beginning to be able to watch and listen to exactly what we want, when we want and where we want.  Technology is bringing our media to us.  Not only do we no longer have to physically plop down in front of a television, the content that we can choose from goes far beyond the selection that we’ve gotten from satellite TV.  Even more amazing, the boundaries between media producer and media consumer are dissolving.  It’s nearly as easy for me to produce my own podcast as it is to subscribe to one.  That’s just incredible.  And it’s far, far easier for people who generate high-value content, like Leo Laporte, to get their content delivered to me.

It’s a wonderful world–and I plan on continuing to wear my podcastaholic badge proudly.

Why Blog?

Ok, here we go–first post in a brand new blog!  What to say?  Oh, the intimidation factor of the blank page.. Though I’ve been journaling compulsively for years and I’m a fairly avid writer, there is an odd feeling to the idea of a public blog.  I realize that blogs are primarily an ego-boosting activity for most people and likely to be unread by all but the author and maybe some family members.  So I’m under no illusions that my ramblings will be read by anyone but me.  But still–writing a blog feels like stepping up to a microphone in some huge auditorium.  Granted, the auditorium is empty at the moment, but it’s still a tiny bit intimidating.

When it comes to blogging, I’m still also suffering to some extent from what my wife and I refer to as a strong case of “disdainium”.  When I first heard about people blogging, or touting their blogs, I scoffed at what I felt was just a fancy new name for plain old ego-laden personal web pages.  Okay, the typography and layout was clearly superior, based on some very nice templates.  But that only served to make the trite content more painful to read because it looked so nice.

So what is the deal with blogs?  Are they really worth writing?  Worth reading?  The answer is, for some authors, absolutely.  Reading content written by the likes of Steve McConnell, Joel Spolsky, Jeff Atwood, or Mary Jo Foley is always time well spent.  These are people who would be writing good books or good technical articles if they weren’t blogging.  Actually, these are people who DO write good books and good technical articles.  A blog is nothing more than a low-energy mechanism for good authors like this to share miscellaneous thoughts with us.  And blogging lets us enjoy them in many tiny little doses, rather than waiting for months to read the next article or for years to digest the next big book.

So we should, indeed, thank the original authors of the blog tools that came up with the now familiar reverse chronological format.  Sure, our favorite bloggers could easily be publishing static web content every week, dishing up plain old HTML.  But the rise of blogging as a well-known cultural phenomenon just lowers the barriers to publishing one’s own content.  And so, for authors of engaging content, blogging is truly a no-brainer.

And what about the rest of us?  Just because we can say something, should we?  If the band leaves the microphone on while they go on break, should the rowdy at the front table walk up there and start babbling to the crowd?  Well, that depends on who the audience is.  At the bar, if the audience is mainly the rowdy’s frat buddies, he should absolutely step up–he’ll be well received.  At orchestra hall, or a coffee shop, well–probably not so much.

When it comes to blogging, the same holds true.  It all depends on the audience.  If my audience is my wife and my mother-in-law, then blogging about my daughter’s latest potty-training escapade will be much appreciated.  And I won’t truly be that heartbroken when Joel Spolsky doesn’t read my blog.  But the beauty of the web is that it’s truly democratic.  Though we basically have one microphone, we’re not all forced to sit in this room and listen to whoever is talking.  For the most part, natural selection will prevail and the people who are the most interested in what we have to say will end up reading what we write.

So that brings me to the critical first-post question–why am I blogging?  This is the same as asking–what do I want to say and who do I want to say it to?  The obvious follow-on question is then–is blogging the best way to communicate with this audience?

For me, it turns out that I’ve thought about blogging for a number of years and managed to resist the temptation until now.  I always ended up concluding that it would just be an ego-feeding activity with not much real value.  Although I love to write and I’ve always written for myself recreationally, I’ve never had much of a desire to write for other people.  And given the cacophony of blogging going on right now on the web, I honestly don’t feel that I have much to add that is of value–or that anyone else would care to read.  I’m also the kind of person who always has about a hundred personal projects that I’m working on concurrently, so I really don’t need to add what I figured was a low-value project like blogging to the list.

But in the end, I come back to the core idea of blogging–it really doesn’t matter if anyone else reads this stuff.  If I have some topic that I’ve invested enough thought into to write it down anyway, why not use blogging as a vehicle for my writing?  That way, I end up with all my ramblings and notes in a central place where I can get at them.  And if there’s someone else out there that gets a little bit of value from what I’m rambling on about, that’s fine too.

As it turns out, I’ve been journaling at a pretty good clip for years.  And, like most diarists, I guess I have a couple of audiences in mind as I write.  My first audience is myself.  Writing about daily life, experiences, or people I interact with is just a great way for me to process everything.

But as I write, I’m also always thinking about my second audience–posterity.  For me, this mainly means–my kids reading my journals at some distant point in the future, after I’ve gone.  I’m sure I’ll write a lot more on this in the future, but this ties in with my passion for family history and capturing personal and family experiences.  I would give anything to have more snippets of writing from my Dad or other family members who are no longer with us.

So that brings my back to blogging.  I’ll never publicly publish any of my journaling on the web.  That’s far too exhibitionist for my taste.  But aside from journaling, there are plenty of miscellaneous thoughts and ramblings that I wouldn’t mind capturing–for myself and in a format that I can pass on to my kids.  I’m under no illusions that I have anything all that mind-bending to say.  But it is what it is and a blog isn’t a half bad place to keep this stuff.  It’s a place that I can come back to myself, for reference.  And if someone else stumbles on my collection of mental trinkets and finds something of value in one of them, then it will have been well worth the effort.