John Halamka, in charge of network computing at both Harvard Medical School and Beth Israel Deaconess Medical Center, considers recent public cloud outages (from Amazon to Blogger), says he remains optimistic about the basic concept, in part because:
Problems on centralized cloud architecture that is homogenous, well documented, and highly staffed will be more rapidly resolved than problems in distributed, poorly staffed one-off installations.
He describes some of the issues his own campus clouds have had over the past year, including:
HMS has clustered thousands of computing cores together to create a highly robust community resource connected to a petabyte of distributed storage nodes. In theory is should be invincible. In practice it went down. A user with limited high performance computing experience launched a poorly written job to 400 cores in parallel that caused a core dump every second contending for the same disk space. Storage was overwhelmed and went offline for numerous applications.