Over the past year I have been inundated with questions from a wide variety of researchers across a number of institutions (national and international) seeking guidance on how best to “relocate” or “retool” their research to use on-demand, utility computing services such as those offered by Amazon. Given the frequency and similarity of these inquiries I decided to put together this FAQ targeted at cloud newcomers who want straightforward information about the viability of on-demand computation tools and their appeal to funding agencies. Most of my experience has been with Amazon Web Services and to a lesser extent Google although I feel that Amazon has a long head start in this space. This FAQ is aimed at biomedical researchers engaged in the management and analysis of genomic data although most, if not all, of the answers can be applied to other research domains.
The Data Deluge:
Unless you have been in hiding, you are definitely aware that biomedical and bioinformatics-assisted projects are overrun with data as Next Generation Sequencing technology continues to offer better coverage and faster run times at decreasing cost. Consequently we are running out of places to put this data let alone analyze it. Being able to “park”, for example, twenty terabytes of data on a storage device is useful though if that data is not also visible to an associated computational grid via a high-speed network and well-performing file system then analysis will be difficult. “Data has gravity” which means that it has become easier to bring the computation to the data than the other way around. If your local environment has no difficulty providing this capability and is fully committed to helping you scale your research then you probably don’t need cloud resources. However, it is important to develop a longer-range strategy since institutions as a whole are looking at cloud computing to reduce costs thus ongoing local hardware acquisitions and associated data center upgrades are being reconsidered.
It’s all about me ! What can the cloud do for me ?
Cloud services provide on-demand, custom compute resources that can be configured and optimized to your unique workload. The same is true for storage, networking (to an extent), databases, and at-scale technologies such as Apache Spark. The cloud-computing model is made available on a “pay only for what you use” model so there is no subscription cost. The resources you “spin up” are yours alone so these environments can be highly customized with the ability to create templates and associated images to facilitate reproducible research and ease distribution of any software products you develop as part of your research. Another way to consider this is that anyone with a laptop, an Internet connection, and a credit card could setup and manage large-scale computational resources including databases, replicated storage, and fast networking. Of course knowing how to accomplish these things is a prerequisite so optimal use of the cloud becomes a function of one’s willingness to learn and/or engage knowledgeable collaborators.
Are you sure this isn’t just a fad ?
Hardly. Independently of Research Computing interests the use of cloud resources is immensely popular in many domains. Ever watch Netflix ? That’s running off Amazon. Your Banking solutions are cloud based (SaaS – Software as a Service) as is your email (MS Office 365) so don’t expect the trend to stop anytime soon. CIOs are under pressure to reduce costs associated with infrastructure and data centers, which means the only viable alternative solutions will be cloud-based. As it relates to Computational Research there will always be a need for some type of local experimental activity (see other questions in this FAQ for more detail) so it’s not as if moving to the cloud requires a total evacuation of local resources – although some IT shops might like that since it takes the pressure off of them to continue to provide local resources. On the other hand the Cloud generally provides far more flexibility in compute and storage services than most IT shops so consider that the cloud isn’t something that happens to you – it happens for you. Also consider that there are a number of excellent large-scale computational resources such as TACC, Oak Ridge, and the SouthDB hub who offer a more specialized form of compute services.
What questions should I ask of my existing computational provider ?
Chances are that your local provider of computational resources is aware of cloud and utility computing so they might in fact have some ideas about what the future holds. However, you should be proactive and ask them what plans and timelines exist to migrate from local to cloud resources. In reality they should be pursuing a hybrid model which is a mixture of local and cloud resources whereas some workloads might be so large as to be feasible only when using cloud resources. A hybrid model allows users to continue within a familiar environment as they learn how to move the workloads into the cloud. A more important question is how your local facility intends to support users in the transition. This is critical to facilitate adoption of cloud resources and to make users productive.
Is using the cloud better than using my in-house computational cluster ?
Are you happy with what you’ve got ?
The answer is mostly a function of your satisfaction level with your current environment. But the two aren’t mutually exclusive. Many people use both cloud and local resources and over time are building into a “cloud preferred” model at their convenience. Relative to local resources, an important question is to what extent those resources will continue to be available and at what cost ? Also ask your local HPC provider to what extent they intend to expand using cloud resources. Hybrid environments can be powerful in that anything that is too large to run locally can be spun-out into the cloud although that linkage has to be setup for this to occur transparently. Many local providers (should) recognize the utility of the hybrid approach as it allows them to expand into the cloud without disturbing local resources. It also provides opportunities for them to train personnel to fully exploit the at-scale technologies offered by the cloud. User support is a key concept and if your projects remain heavily reliant upon support personnel then you will need to budget for similar assistance when moving to the cloud.
In the cloud it all belongs to you
One idea that cloud novices frequently miss is that when using on-demand compute resources, the resulting environment is yours and yours alone unless you wish to share it. That is, one does not need grid management software as you can run jobs at will and without waiting. Moreover, you can start up multiple servers of arbitrary memory and storage size to arrive at the optimal configuration for your workload. To illustrate this point I recently assisted a researcher in the creation of an Amazon x1.32xlarge instance (128 vCPUs and 1,952 GB of RAM). As his computational career to date has exclusively involved use of shared Beowulf-style clusters his first question was, “So what is the average wait length of the job queue ?” He simply didn’t get it that the resource was for his use alone and that instead of waiting six days for his queued jobs to be accepted and completed (his usual experience) he could be finished in approximately eight hours. Of course this comes at a cost ($13.33 per hour as of this writing) and if you are paying little or nothing for local resources then perhaps waiting six days as opposed to eight hours is acceptable. You’ll have to make that decision.
What workloads are NOT appropriate for the cloud ?
An obvious case is any project involving protected health information or code associated with proprietary software development. While Amazon is able to accommodate these use cases you should definitely ask someone locally to determine the relevant policies. Not all IT shops are supportive of or enamored with the idea of cloud usage in general let alone when it intersects with sensitive data so be prepared for some strongly worded responses though, again, protecting health information is essential so ask around before moving data anywhere ! This includes USB sticks, DropBox accounts, wherever.
Another scenario wherein you might be better off running locally would be any service that can be hosted for free. After all – you aren’t paying anything so unless performance or scalability issues exist, or the subsidies that keep the service free are in jeopardy of ending, then no one would blame you for using such resources in perpetuity. That said, it’s easy to experiment with low cost Amazon instances configured with the LAMP stack to move, for example, a lab web site into the cloud if you wish to do more aggressive web development or make your site independent of an institution.
Also if you have a collaborator who has access to computational hardware ,or has a relationship with a computing facility, at an attractive price then obviously you would explore that option especially if the collaborator is going to be taking the lead as it relates to analysis and data management.
Does the arrival of cloud computing imply that super computing entities are no longer viable ?
Not at all. Dedicated high performance computing installations such as XSEDE, Open Science Grid, Oak Ridge, TACC, (to name a few) can and do provide excellent computing resources as well as expert user support. Depending on the nature of your research along with your funding agency you might receive financial allowances from one or more of these organizations. While many workloads can be “spun up” in the cloud on your own it might be beneficial to first leverage these resources especially if they bring specific expertise to bear on your research problem.
Does the NIH endorse Cloud Computing ?
Consider reading the publication “The 25-Point Implementation Plan to Reform Federal Information Technology and Management.” This was published in 2010 and discusses a shift to a “cloud first” approach. The idea here is that as far back as seven years ago the NIH recognized the viability of cloud computing and sought to document it’s intent to formally recognize it as a future direction. It is important to consider that access to cloud technology has had a democratizing effect in that institutions and research laboratories with comparatively limited computational resources can now compete with better endowed institutions by leveraging the enormous power of on-demand computing. The conclusion is the NIH has been “on board” with cloud computing for years now.
Also check out the NIH Data Commons Pilot Project associated with the BD2K initiative which uses “a cloud-based data commons model to find and interact with data directly in the cloud without spending time and resources downloading large datasets to their own servers”. The concept here is that “data has gravity” which means that it is easier to bring th e compute resources to where the data is stored as opposed to dragging data (possibly in the terabyte to petabyte range) to where the compute resources live.
In 2015 the NIH issued a statement outlining its position on storing controlled access information in the cloud. This relates more to the idea of NIH’s recognition of the value and increased use of cloud resources than it is an actual endorsement but since that time use of Amazon and Google cloud resources within NIH funded research has grown considerably. Check out the NCI Cancer Genomics Cloud project page for an idea of some of the pilot projects.
When applying for grants it is helpful to view computational cycles as “consumables” much in the same way we view office supplies. The idea is that funding agencies might not be so intrigued by the underlying technical details of a computer’s chip set (unless of course that is part of the research) only that the price per cycle is competitive and that any awarded funding will be used optimally.
What is involved in moving to the cloud ?
In the case wherein you do not have an existing workload then it’s as simple as signing up for an account and then launching compute instances. Of course knowing how to do this, and doing it in a productive way, requires knowledge so if you aren’t up to speed on then you will need assistance. But this is no different than if you were to move to an institution and needed help acclimating to the resident computational resource. Some places offer lots of user support whereas some offer almost none. It really depends on the environment.
In the case wherein you have existing workloads then you will do what is known as “lift and shift” wherein you reproduce your local computational environment in the cloud, upload your code and data, and then confirm that the results you are accustomed to getting locally can be reproduced in the cloud. Given the wide variety of available compute instances it is fairly easy to match a local configuration. After your computation is complete then you can allow your data to remain in cloud storage or you may choose to push it to backup storage (e.g. Amazon glacier) or maybe download it if you have a good deal on local storage.
Consider though that simply reproducing workloads in the cloud might be under utilizing the capabilities of elastic computing. Instead of trying to reproduce a Beowulf style cluster environment, Amazon offers a system with, for example, 64 Cores and 256GB of RAM so much can be achieved using a single instance, which in most cases would simplify the deployment of your work since you would not need to run jobs across many nodes unless desired. That said, one could spin up a classic cluster solution within Amazon using software such as CfnCluster or MIT’s StarCluster.
I don’t know much at all about configuring Linux machines or how to architect computational environments. What can I do ?
In some cases you might be able to benefit from using SaaS (Software as a Service) solutions that neatly sidestep the requirement for you to create everything from scratch. Services such as Galaxy Cloudman allow you to create a fully functioning scalable version of Galaxy, which is a popular bioinformatics framework for analyzing genomic data. Of course you are still left with the job of using the software but you did not have to agonize over how many servers to implement, how much storage to setup, what operating system to use, or what versions of various genomes to download. This is a true convenience especially for someone just breaking into large-scale genomic computation. There are also SaaS solutions such as Seven Bridges genomics that provide professional support services so if you anticipate a need for that type of involvement check them out.
If you don’t want to do any of this or if you are transitioning to a phase of your career that will involve more in-depth computation you will need to recruit knowledgeable students. Knowing how to instantiate, manage, and effectively exploit scalable cloud resources is a career path all to its own so many students might be interested in acquiring these skills (or might already have them). However, if you anticipate relying upon students remember that their “day job” is as a student in a graduate program so they still need to focus on science and research. You should identify students with some programming experience ideally with multiple languages, (Python, C++, Java, R) and hopefully some Linux command line experience and, if you are lucky, some system administration experience. Spinning up Linux instances is very common thus being able to install, upgrade, and configure open source packages is essential. Thankfully, many bioinformatics analysis environments are being distributed as “AMIs” or Docker images for convenient use with little or no up front configuration.
I have collaborators at other institutions. Is it possible to include them in my research when using cloud resources ?
Yes. In fact this is a key feature of Amazon AWS in that it uses Identity Management policies to assign roles for team members independently of location. Of course it is also possible to setup computing instances on VPC (Virtual Private Clouds) that may or may not be available on the general internet though this is completely under your control. The cloud provides an easy to use shared middle ground for collaboration and ongoing work that can “mothballed” and then reanimated at will without losing configurations and various workflow dependencies.
What are the basic concepts behind cloud computing ?
There is an incredible amount of hype and terminology that overlays the cloud space much of which has emanated from vendors seeking to reposition themselves and their products to profit from cloud technologies. However, as a biomedical researcher your concerns probably relate more to processing terabytes of sequencing data or building an accurate, robust predictive model, or processing billions of remote sensor readings, all of which require far more than a canned service. Please understand that not all vendors who offer solutions are flawed or are hawking yesterday’s technology under a new name. Just that there are enough of them that caution is warranted.
Consider that there are 1) computational environments, such as those offered by Amazon and Google and 2) whatever you wind up putting “on top” of these computational environments. For example you will hear of SaaS (Software as a Service) solutions that are “one stop shops” provided by a vendor or institution. Think of Microsoft 365 as being such a solution wherein you access email, calendaring, and file sharing from one interface via a web browser. In the world of bioinformatics think of something like the Galaxy CloudMan package that offers a comprehensive environment for genomic data processing that transparently provisions the necessary servers, storage, operating systems, and networking, required to support such a tool.
However, for the typical computational researcher the point of entry into the cloud will typically be “IaaS” Infrastructure as a Service, which allows you to specify the amount of storage, number and type of servers, the networking them between them, and the operating system type. Shared file systems can be setup as can event driven processing. IaaS provides the greatest amount of flexibility but also requires the greatest amount of knowledge to provision. Some researchers slide comfortably into this role though most seek out collaborators and/or recruit students who can help.
What type of improvements can I expect by moving my code to the cloud ?
If you are experiencing performance bottlenecks then it is natural to assume that any “larger” resource will probably provide enhanced performance. Frequently that is the case though it isn’t always clear what key factor(s) is most responsive to an enhanced resource. That is, was it the increased network bandwidth, additional memory, or additional core count that led to better performance ? There are ways to make this determination. But estimating a “percent improvement” simply by moving to the cloud (or even another local computer) is difficult in absence of an existing benchmark yet it is one of the most frequently asked questions I get. I understand that many researchers (or their students) are executing pipelines given to them by someone else so having an intimate knowledge of which components in a workflow might need refactoring might not be a primary consideration though it really should be. Especially, when anticipating a need to do massive scale-up. Frequently the code and data are moved to an instance and that becomes the benchmark, which is fine. Just keep in mind that arbitrarily selecting configurations with the hope of a generalized performance increase is not a very scientific approach. If you need help ask for it.
What do I need to watch out for when experimenting with the cloud ?
The unanticipated big bill
The biggest fear anyone has involves receiving a large bill after engaging in some basic experimentation. Some researchers fear that merely signing up for a Google or Amazon cloud account will incur cost. Don’t worry this isn’t a cable subscription as you pay for only what you use so if you don’t use anything then guess what – no charges will accrue. There is a considerable amount of Amazon training material available on YouTube that helps novices become comfortable with navigating the AWS Dashboard. So even for total newcomers it is easy to setup and account and begin experimentation.
A typical experimentation session on Amazon might involve you logging in, creating an instance, having some fun with it, but then you get interrupted. So let’s say that 4 hours have passed and you come to see that your instance is still running. Guess what ? You are being charged whatever that hourly rate is although if you booted a smaller instance you would be looking at maybe $2.50 total. So you pay for what you use. The solution however is very easy. Amazon allows one to set “alarms” to send email or text messages to you based on user specified thresholds. So the very moment your use exceeds a certain dollar amount then you get notified. This saves you money and also provides you with motivation to watch your workloads. As your computing becomes more complex you can use load and or activity triggers to deactivate instances that have no use within a specified period of time.
Amazon has training wheels
Amazon does provide 750 free hours for experimentation though it requires use of “free tier” resources. That is you can’t just login in and spin up a 128 Core 1.9 TB instance for free. The free tier involves micro instances that I consider to be the “training wheels” of the cloud. These instances are not very useful for doing real genomic computing work but they are quite useful for learning how to configure virtual machines and become familiar with the basic processes of managing instances.
I’m currently negotiating a startup package. Should I request cloud credits instead of actual hardware ?
This is an interesting question as being able to continue your work in a new environment is essential to your success. For whatever reason some researchers are shy about drilling down deeply into the computational capabilities of a potential employer when that should probably be one of the first things to discuss. This is particularly important given that institutions are reviewing (or should be) their approach to maintaining and refreshing local hardware in light of the potential cost savings offered by Amazon and Google. The last thing you want to have happen is to show up only to have minimal support for your work. All this said I’m pretty sure you could request cloud credits but this assumes that you have some familiarity and experience with AWS to the extent that you could reasonably project workload needs. In absence of that you could still request some credits for experimentation while using local resources. Independently of the cloud vs local resources issue always make sure that good user support for your project will be available. I’ve talked to many researchers in transition who observe wide variation between institutions in terms of computational support. Always make sure you ask detailed questions about the computational environment before accepting the job.
I use licensed software like MATLAB. Can I still use the cloud ?
This is an important question because many institutions have site licenses for commercial products such as MATLAB that allow them to run jobs locally. The best answer to this type of question is to contact the vendor directly and ask them to describe approved cloud options. Most will direct you to an Amazon Machine Instance that has the software pre-loaded and licensed in a way that it is available once you launch that AMI. The vendor will typically charge fees on top of the Amazon use fees to cover licensing fees. Mathworks, (MATLAB’s parent company), has its own cloud service which is in turn based on Amazon AWS that one can use by signing up on the Mathworks site.
You might also consider refactoring your code to exclude the dependency on the commercial solution. Of course this assumes that you can do this without impacting the results you are accustomed to getting. This isn’t to say that these commercial solutions are somehow bad – far from it. Just that you might want to determine if the code you are using cannot be replaced, if only in part, using open source tools. As an example, the Julia language provides a robust, well-performing, parallel language for matrix manipulation (among other things) – all for free. More common open source substitutes include Python and R. Note that I am not suggesting that replacing your commercial code with these alternatives is easy or could be done over night. Depending on what you are doing in the code it could become an involved process. However, if you need to scale up your code and you want to disperse or share your code with a larger community then moving to open source will help accomplish that aim.
From a practical point of view how do I start using Amazon Web Services ?
First sign up for an account. Don’t worry it doesn’t cost anything though you will be prompted to enter a credit card number. You might also consider apply for an Amazon Research Grant credits that can be used for purposes of aggressive experimentation. Consider completing the AWS “getting started tutorials” designed to acquaint you with the basic procedures behind spinning up some example Linux-based instances. For institutions with a lynda.com subscription there are Amazon orientation videos available also. Keep in mind that most successful computational researchers will assemble a team over time over which work can be distributed accordingly so always think collaboratively as Amazon has many services in addition to “elastic computing” that will require an investment in time and effort to master. On the other hand this is true of any field wherein innovative services are on offer – Learning to integrate them into your work does take time so budget accordingly. The idea of user support is always important so you should ask your local IT or computational resource to what extent they will assist (or not) your use of the cloud. Ideally they would be enthusiastic about helping you though understand that the “cloud vs local” issue can represent a political hot point for IT organizations concerned about protecting “turf”.