Open Telekom Cloud for Business Customers

Open Telekom Cloud adoption guide

Learn more about proper use cases and their transformation

Do you wonder how to make use of T-Systems’ Public Cloud (IaaS - Infrastructure as a Service) offering Open Telekom Cloud (OTC) for your future workloads? Do you wonder as well how you could migrate your existing workloads from another infrastructure like AWS to Open Telekom Cloud?

This blog post is just the first of a series and will explain you the general basics and advantages of scale-out cloud infrastructures. You will also learn more about the key differentiators (like SLAs) from traditional physical or virtual (“legacy on premise” or "Enterprise Cloud") infrastructures. Further specific challenges to get virtual machines and services working well in Open Telekom Cloud are addressed next along with some considerations on security and deployment automation.

Since I mentioned that this is just the first pitch of a series you maybe ask yourself now what’s next? Just get back to our blog here on a weekly base and you’ll find the continuation:

  • Next time I tell you more about how you could make VMs work well in Open Telekom Cloud
  • In part 3 I tell you more about the usage of Open Telekom Cloud for test and development environments
  • Part 4 will bring some light to you on how to implement a web presence on Open Telekom Cloud
  • Last but not least (for a start) part 5 is going to explain you how migrations of applications from another scale-out cloud platforms could be successfully executed.

With this series of posts I do explicitly focus on the general technical considerations to make applications work well on Open Telekom Cloud and take advantage of its benefits. Every cloud adoption project will involve several other phases and implications like analyzing business requirements, efforts to adopt an application that it works well on a Public Cloud, organizational changes and of course cost. All these aspects can be individual and complex and thus will not be touched be me in this blog.

Part 1 – Basics, advantages and differentiators of Open Telekom Cloud

Adoption steps toward Open Telekom Cloud

As you surely will be used to an Private (Enterprise) Cloud or “legacy on premise” IT model the first step you have to take towards an Open Telekom Cloud adoption is to start with a learning session to understand the “on demand and pay-per-use” Public Cloud model. You need to know first the fundamentals of Public Cloud architectures and what it means for the applications which shall run on it in terms of adjustment.

The different use cases for the Open Telekom Cloud provide you a guidance for the next step: identifying the perfect application candidates for the Open Telekom Cloud. You should carry out an assessment of each application which covers availability requirements, scale-out design, legal/ SLA and needed transformation skills.

The following picture provides you an idea of this assessment process:

Graphic showing the Open Telekom Cloud Assessment process.
 

Once you’ve identified an(y) application(s) and use case(s) the application is to adjust to a lower or higher degree potentially. Certain functions can be replaced by platform services or standard scalable open source building blocks. The platform services of Open Telekom Cloud need to be chosen and customized to fit perfectly together with the application. This has to result in the desired availability and resilience.

If you’ve adjusted the application and did the implementation of the platform services you require your next challenge is a thorough testing phase. The testing should be automated as fully as possible (even consider test driven development). In order to achieve the required service availability of the application failures of instances on the platform should be a key part of your test scenario as this just belongs to the paradigm of Public Clouds (Netflix’ “Chaos Monkey” is a great reference here.). Thus monitoring the service availability is key. Monitoring the health of the virtual machines is not. So you better deal with instance failures to improve the resilience of your application.

Besides the testing of the application you have to consider and plan the transition of application data to the Open Telekom Platform. Depending on the amount of data the transition can become a rather huge challenge. “Bandwidth as a Service” is introduced with the Open Telekom Cloud to cope with part of the challenge. A shift to a Public Cloud brings also changes to operations processes. Up to you if Managed Services or DevOps (you develop and operate yourself) better fits the operations model for the application and the business.

Scale-out cloud applications

Next important aspect you need to get is the paradigm change which comes along with the evolution of IT-Outsourcing from the “legacy on premise Enterprise” model to the “pay-as-you-go, on demand, scale-out and commodity” nature of Cloud Computing.

In short:

Cloud Computing with all its flexibility advantages is based on cheaper out-scaling commodity hardware which implicates less availability on infrastructure level as legacy IT-Outsourcing approaches which rely usually on more expensive up-scaling enterprise equipment. Thus service availability becomes a matter of the application instead of the infrastructure layer. So you have to adjust your application to scale-out instead of up if more compute/network/storage capacity is required and to survive (more usual) infrastructure failures.

Some more background:

Public Cloud or -- to use a better term -- Scale-out Cloud Architecture is based on different paradigms than traditional infrastructure. Smart engineers have spent decades to make infrastructure (hardware, hypervisors, operating systems, ...) more reliable. However, there are diminishing returns: Squeezing more 9's (i.e. increasing availability above 99,999%) out of the infrastructure incurs higher and higher complexity and sharply increasing cost. On the other hand: the extension of commodity infrastructure has given it a huge economy-of-scale advantage and Scale-out clouds have turned the equation even around: What if we could build reliable services on top of only moderately reliable (read: commodity) infrastructure? Answer: The service availability is the only thing that counts!

During the last decade, development of paradigms that fit into the scale-out cloud architecture have emerged and been leveraged to produce very reliable services with very high flexibility at much lower cost than traditional models would have allowed.

Graphic shows Scale-out-cloud applications.
 

The target of this chapter is to introduce the most relevant concepts for scale-out cloud application architecture to sketch the perfect citizen for Open Telekom Cloud. The principles are agnostic to the exact underlaying cloud -- they apply not only to Open Telekom Cloud, but also to other OpenStack based clouds or other scale-out clouds such as AWS.

Architecture and design principles

From cattle and pets

When you come from the physical world, a server is something expensive, that needs to be procured with a lengthy process. Afterwards, the server is being carefully configured, maintained and careful change and update management is applied to keep the server productive. If you have lots of servers you use enterprise management systems to keep an overview over the state and configuration of the machines and you ensure the servers stay healthy. If you have virtualized your servers for better utilization, chances are that you use an enterprise virtualization (or enterprise "cloud") management system and you have not fundamentally changed the way you look at servers. Randy Bias called this way of using a server “treating it like a Pet.”

What if VMs were abundant, can easily be created and deleted at near-zero overhead? Would you change the way you use them? This is where scale-out clouds ("public" clouds) have changed the model. You no longer care for individual VMs. You create them as needed and dispose of them as you no longer need them. This is “the Cattle approach” to managing machines. You are not interested in the individual machines any more as long as the overall cattle moves the way it should.

There are a few angles to look at this paradigm to make it clearer Modular scale out design, State and Statelessness, Automation of Configuration, Failure domains. So let’s look at these more deeply:

Modular scale-out design

Scale-out designs tend to break tasks and jobs in small services (micro services) that can be used by an API (often REST-based) and that can scale. If the service on one VM cannot fulfill the throughput requirements, more VMs can be started and share the load. This is a service driven architecture: services talk to underlying services to fulfill their jobs and handle failures by reconnecting after a small timeout. Some application scaling logic ensures that the health of the service (and not of the machines !) is monitored and more VMs are started if needed (auto scaling).

Statelessness

When VMs come and go, they are not the place to save state. A VM can die anytime -- as Netflix put it on their famous “Chaos monkey” posting: Instance Failure is frequent. The resilience of a cloud application comes from the fact that it survives the death of a VM without losing any data -- if the VM has no persistent state, none can be lost.

On ‘Amazon Web Services’, the root disks of most instance types (flavors) are fugitive -- if you reboot, they will be unaffected again. This strongly encourages a model where no one even thinks of using the root file system of a VM as persistence layer for anything. On Open Telekom Cloud, the root disks are persistent by default to make the transition from a classical world easier, but it is a useful mindset to consider them as non-persistent. If a VM needs to persist data, it needs to be written to some place that is not exclusively owned by the VM: A database, an object in the object store or (if it can't be avoided) a data disk that is attached to the VM for this purpose. The big advantage of a stateless VM is that it can't lose persistent data on being killed, crashing or being cold migrated. The configuration of a stateless VM needs to be stored in the image or -- better -- be injected from the outside on bootup. This is the topic for the next section.

Automation with Cloud-Init- and/or Config Management tools

Rather than creating an image that has all needed configuration pre-injected already (and this way creating a need for image proliferation), a best practice in scale-out clouds is to use a small set of generic images and inject configuration at boot up.

In the Linux world, cloud-init has become the standard tool to do this configuration and customization at boot up. cloud-init reads a “YAML” configuration file (into the booting system) and gets additional configuration from outside configuration data sources in OpenStack environments, the meta-data. Cloud-init consists of dozens of modules that can do things like setting hostnames, installing package updates or additional packages, resizing the root partition and root file system (to fill the disk), setting passwords or injecting authorized ssh keys or hooking up into a full blown configuration management system such as chef, puppet, saltstack or ansible. It can of course also do simple things such as injecting files or running scripts. This way, on boot up (every boot up if you have ephemeral root disks, only the first boot for persistent root disks), the generic image is configured to become a specialized VM that fulfills the task it has been started for.

Using commodity building blocks and/or platform services

Restructuring services in ways that they scale and tolerate failures can be significant work. Fortunately, the success of the open source movement predates the emergence of scale-out cloud offerings. There are numerous good open source solutions available that can be used as building blocks. Just like the cloud uses commodity infrastructure, there are commodity building blocks available. Key-value stores. No-SQL databases. Configuration management systems. Code management and CI/CD systems. Messaging platforms. Etc.

Having identified a service that is part of an application design, the following steps are recommended:

  • Is there a service that the platform offers (such as object storage, RDS, ...) that fulfills the need?
  • Is the SLA good enough and the lock-in effect justified? Then use it.
  • Is there a company that had to create software with a similar need and has it published it? Even as open source maybe? Or at least things we can learn?

Automation - Using the API

Fast scaling and fast recovery don't work if the systems needs manual attention to react. Even in the classic world, a good operator never does the same thing manually twice. In the cloud, the operator uses the tools to automate deployments, recovery, scaling. A few tools are used very often for this purpose. Scale-out cloud environments have an API. You can use the API to automatically create virtual networks, virtual storage and virtual machines. The APIs are typically simple REST interfaces -- with simple tools, you can compose the requests and parse the response from a cloud service. Chances are though that you will find higher-level abstraction tools, such as “Terraform” (abstracts AWS' CloudFormation and OpenStack's HEAT).

In addition to the API tools, you will likely use Configuration Management systems such as chef, ansible, puppet or saltstack to automate the deployments and maintenance of your environment.

Understanding AZs (Availability Zones) and the SLA

Some events affect not just a single host or rack, but with bad luck, a complete datacenter may lose its connectivity or power or goes down in a fire. While a cloud-native application is in general well prepared to deal with failure, this is a particular challenge: If all machines providing a specific service go down at once, there will be a service interruption -- hopefully a short one until new VMs have come up and can take over. This can be avoided by using the concept of Availability Zones. The cloud setup is distributed over several datacenters that are independently powered and connected to the net, so a catastrophic event is unlikely to affect both at the same time. Applications have visibility on the availability zone. Real resilient applications will replicate the persistent data across several AZs and thus survive the death of one AZ without service interruption.

It is important to understand the availability (SLA) definition of a scale-out cloud: Based on commodity components, the hardware does only provide moderate guarantees. While the cloud control plane is secured by redundancy, this is not the case for compute hosts. If they go down, all the VMs go down as well. A dying VM is a normal event on a scale-out cloud -- if the application can not deal with it, it will not achieve more than 99% availability. A scale-out application will just restart the VM somewhere else -- as it is stateless, nothing got lost.

The SLA (currently 99.95% on Open Telekom Cloud) does guarantee that no more than one Availability Zone gets dysfunctional. An application that thus wants to provide services at 99.95% availability needs to replicate the data it needs to work across the two zones.

Monitoring

The focus on monitoring should not be VMs. As already discussed: VMs not being well is a frequent event and no reason for alarms. The focus for cloud applications is on the service. So these need to be monitored. Good cloud applications have instrumentation (availability and response time / load monitoring) built into every layer and thus control the scaling as well as health reporting from the application. Ultimately, the end user service counts -- for monitoring this, a service outside of the cloud is the ultimate measurement. There are services in the net available.

Automatic scaling out

Having the application instrumented to monitor the health of the services that compose it allows to take automatic action. When a service is unable to handle the load, new virtual machines can be started to provide additional capacity (scale-out). Likewise, if the load remains low for a certain amount of time, additional VMs can be removed again (scale-in). This mechanism also helps to recover from failed VMs -- the same scale-out mechanism will work to replace failed VMs. There is a scenario that happens very often in scale out applications: There is a load balancer that distributes the requests from another layer to a service. The Open Telekom Cloud platform offers a mechanism, where you create a custom image for VMs providing the service. The VMs are grouped in an auto scale group, and monitored with respect to response times (http or tcp) or VM metrics (CPU load etc). Policies can be defined that allow for automatic reactions when certain thresholds are exceeded for a specified amount of time. If new VMs are provisioned (scale-out), they will automatically be added to the related load balancer. Likewise for the scale-in operation. These autoscale groups provide a very simple and convenient way to do scaling, because most web applications do not need to be changed in any significant way to support this model of operation.

Testing

In addition to the application individual testing, two additional capabilities need are to validate:

  • The automatic scaling
  • The recovery from failed VMs

A nice tool to test the latter is the Chaos Monkey. The Chaos Monkey is a famous beast, invented by Adrian Cockroft of Netflix. Netflix realized that the only way to deal with the frequent failures of VMs in a public cloud is to consider VM failure as a normal event. To test for this, the Chaos Monkey was written and released (as Open Source). It randomly kills VMs of an autoscale group. Netflix meanwhile runs this during business hours (!) on their production system (!!) to validate that the resilience of their application is at the expected level.

Change management, CI/CD and DevOps

Most cloud applications are developed in an agile way. The code is constantly built (Continuous Integration - CI) and very frequently deployed into test environments (Continuous Deployment - CD). Deployments into the production environment also happen very often; weekly deployments are not untypical. The deployment process itself is part of the engineering -- the Development and Operations teams both are involved in it and have a joint responsibility to ensure that deployment and operations work. Overcoming the traditional development vs. operations silos has proven very beneficial -- the close collaboration model is referred to as DevOps. Heavyweight change management processes that try to avoid change wherever possible do not fit very well in this world. Instead the testing and validation is automated as is the deployment and change process. Should serious errors slip through nevertheless, a quick rollback action can be taken.

What’s Next?

As mentioned at the beginning of this post already next time I’m going to share with you how you make VMs work well in Open Telekom Cloud. So you will learn something about the structural elements of the Open Telekom Cloud, what they are, for what they are used and how they fit together. Further I tell you something about OS image opportunities you have, either the public images which are already available on the cloud or own images, that can be uploaded to the cloud.

Ressources

OTC Documentation Center: https://docs.otc.t-systems.com/
More OTC blog posts: Blog


Fotos von Markus Meier

Markus Meier has been involved with IT outsourcing providers for over 15 years. During that time, he held various positions in operations and engineering. For the last five years, he has led a worldwide Linux engineering team at T-Systems. His focus is on the effects and challenges facing the provider during the paradigm shift from classic IT outsourcing to cloud computing. He is actively involved in the evolution of this technology. On top of this, he is currently studying IT Management at the Steinbeis University in Berlin.

 
 

Book now and claim starting credit of EUR 250* (code: 4UOTC250)

 
Take advantage of our consulting services!
Our experts will be happy to help you.
We will answer any questions you have regarding testing, booking and usage – free and tailored to your needs. Try it out today!

Hotline: 24 hours a day, seven days a week 
0800 3304477from Germany
+800 33044770from abroad

* Voucher can be redeemed until December 31, 2023. Please contact us when using the voucher for booking. The discount is only valid for customers with a billing address in Germany and expires two months after conclusion of the contract. The credit is deducted according to the valid list prices as per the service description. Payment of the credit in cash is excluded.

 
  • Communities

    The Open Telekom Cloud Community

    This is where users, developers and product owners meet to help each other, share knowledge and discuss.

    Discover now

  • Telefon

    Free expert hotline

    Our certified cloud experts provide you with personal service free of charge.

     0800 3304477 (from Germany)

     
    +800 33044770 (from abroad)

     
    24 hours a day, seven days a week

  • E-Mail

    Our customer service is available free of charge via E-Mail

    Write an E-Mail