Archive

Author Archive

Healthy IT Organizations for Dummies

July 15th, 2009
Comments Off

it-nyhed.jpg

I’ve been working in different IT organizations for close to 10 years, and I never had a user come up and praise me on how great an Exchange server has been running or how reliable a network has been for the last six months. The truth is that work in IT is mostly thankless, and the only time you hear from users is when something is broken. The usual sign of a well-run IT organization is the absence of both complaints and praises, when everything is quiet on the western front. The good news is that IT guys are not dying for your acclaims–they just want things to run smoothly so they can play DOOM3 while everyone else is working. In this blog I will talk about some of the ingredients that make up a healthy IT organization.

Mission
Every IT organization must have a mission and possess a clear view of its roles and purpose for existence to be effective. In order to create a meaningful mission statement, we will need to consider the roles that we must play.

1. Give the users what they need
Supporting users by providing all the tools, resources, systems, and any IT support they need to get their job done in the most efficient way.  Provide the technology to help developers develop, sales people sell, and executives run the business, and not waste their time setting up Outlook or configuring Apache.

2. Build a solid foundation
Build and maintain a reliable infrastructure consisting of network, server, telecom, email, intranet, and any other service that is critical for the business to function. This foundation has to be maintained at all times and without it, the ability of the business to be successful is diminished.

3. Look ahead
An IT organization has to be in tune with the general direction, goals, and growth trends of the business. This is necessary to anticipate the growth rate, special requirements, and scale appropriately. We need to be ahead of the curve and be prepared to cater to the business’s needs.

4. Why are we here?
Nurturing an attitude of service in an IT organization is the key to happy users. The sole purpose of any IT organization is to serve its users, period. The users can either be internal to the company or outside customers, but in either case IT would not exists without them. Helping internalize this purpose in all the members of the IT organization will increase the level of service, commitment, and quality of work.  Not to mention it will eliminate traditional rivalries between IT, developers, and other teams. The best in people comes out when they serve each other and this is true in business as well as in our daily life. Every IT team member needs to participate in developing a mission statement for the IT organization.

The process of developing a mission statement together creates buy-in from all members and a higher level of follow-through and execution of the mission.  Having a clear purpose and mission enables powerful forces such as accountability, proactive actions, self motivation, and ownership make the team successful.

Communication
In an IT organization of any significant size effective communication is key for efficiency, fast response to issues, and reducing human error. Here are some tools that can help attain good communication.

1. Internal Team Communication

  • It’s beneficial to have regular team meetings to go over outstanding issues, announcements, upcoming projects, or team conflicts. Presenting an open forum for all team members to share openly can go a long way in reducing conflicts and increasing moral.
  • Tools such as a dedicated conference line, an internal e-mail alias, or an Intranet site dedicated to the IT team can ease communication and keep everyone on the same page in emergencies as well as during daily work.
  • Keeping an open door policy with the upper management can be very beneficial in exposing internal team problems and helping address them. Regular one on one meetings with all team members and management can help build relationships, address problems, help mentor, and motivate team members.

2. External communication with users and clients

  • It’s important to communicate to the end user about any upcoming maintenance or changes to the environment that could affect their ability to work. Some of the typical events that need to be communicated are forced password changes, system maintenance, and changes to the way user’s access resources, and tools that they need to do their job. All these events need to be communicated in advance and shortly before the events take place. This will reduce surprised users and false alarms, and enable users to plan ahead and get their job done.
  • Tools such as an Intranet site, email aliases or meetings can all be all used to communicate effectively with users. One needs to make a judgment call about which communication medium to use when notifying the users. It’s a good idea to only reach out to users when absolutely necessary and avoid over communicating things that are not important. Most users don’t need to know that the RAID array on the mail server needs to be rebuilt; they just need to know that the Exchange will be down for 4 hours. Over communication can create an impression that things are always broken and desensitize users to important notifications.

Effective communication helps teams work interdependently and attain higher levels of productivity.

Team Dynamics
How team members interact and work together determines their productivity and ability to be effective as a whole. There are a lot of factors that can influence dynamics of a team in a positive and negative way. I will focus on positive factors that have been most influential in my experience.

1. Involve team members in decision making
All team members should participate in decisions making and be able to express their experience and the best solution as they see it. It’s important to make everyone a part of this process, as it produces a feeling of ownership unleashing creative energy and commitment to get the job done.

2.      Create clear accountability and avoid micromanagement
It’s necessary to clearly define what every team member is responsible for. When there is a project or a task that needs to be completed, the manager needs to communicate clearly and define the following items to a team member:

  • The expected end result
  • The time frame or dead line
  • The resources that are available to assist
  • How this task or project aligns with the team or company goals
  • What dependencies there are to completing this task or project
  • The reward for achieving success and what success means in this case
  • The price of failure and consequences

It’s advantageous to clearly define the above items, because they help to clearly set expectations and keep the person accountable to their task or project. Consequently setting clear expectation eliminates the need for micromanagement.

3.      Respect everyone on the team
All members of the team deserve to be treated with respect and as professionals regardless of their position or performance. Scorning members in a team setting is almost always a bad idea and will only create a disgruntled employee that doesn’t feel good about themselves, having zero motivation.  Scorning should not be mistaken for keeping a team member accountable in front of others. Real team work is when one person’s strength helps compensate for the other person’s weakness.

4.      Encourage  career development
Understanding every team member’s long term career goals is the responsibility of every manager. A good start may be, “Where do you want to be in 2 years?”. In some situations a team member may not have a clear career path and you may be able to make some suggestions to what path they should take depending on their strengths and talents. In either case it’s a great idea to give employees some tasks and projects out of their current job description to help them find themselves and develop new skills. Reading literature and writing articles is more of a personal endeavor, but it can also be encouraged.

Many of us spend more time at work then we do with our families and its makes sense to develop a healthy working environment to better our quality of life. Team work offers countless opportunities for growth and character development.

Documentation and Procedures
Whenever an IT organization starts to grow the need for uniformity and well defined operations guidelines becomes important. It’s hard to scale an IT organization when you are disorganized and don’t have good documentation and procedures. The lack of these will create an unmanageable organization that won’t scale. Here are some items that can create clarity and focus through good organization.

1.      If it needs to be done more than once document it
Documentation is an investment of time that pays for itself many times over.  Not many of us like to document, but we all like to find solutions to our problems on Google or on the Intranet saving us hours of research and analysis. The rule of thumb is if you just spent 4 hours solving a problem you should probably document it and save someone else 4 hours thereby making the company more efficient.  Writing documentation also helps imprint information in to your memory and also develops writing skills.  Another variation of documentation can be a form of a knowledgebase for users. A knowledgebase will contain solutions to common problems that users can resolve themselves with basic directions. Such a tool can prevent new tickets from being created and give more time to the IT team to work on other issues

2.      Someone must have done this before
Procedures are necessary to create order and consistency in an IT organization and should be developed as an organization matures. Some good items to write procedures for are: system naming conventions, server builds, IP address assignment, escalations, new hire steps, terminations steps and others. Any task that’s done regularly by different team members is a good candidate to write procedures for.

It’s important for management to emphasize and enforce the creation of the documentation and guidelines as well as make sure they are followed.  This investment will pay off in a big way down the road.

Innovation and Cost Effectiveness
There are many ways to skin a cat as well as many different tools to help users get their job done. It’s up to the IT organization to research and implement solutions that are most cost effective and practical. There are real cost savings that can be actualized by utilizing the myriad of open-source tools available today, but one has to use judgment when committing to any solution by weighing the pros and cons. I will describe some of the criteria that can be used to find the best tool for the job as well as slim down IT costs.

1.      What do they really need?
In order to find the right tool for the job it’s necessary to identify all the functions needed to meet the user’s or organization’s needs. The key is to differentiate needs from wants and focus on functionality that is critical. For example, in the case of selecting an email system one will need to identify the must have functions such as web interface, calendar, meeting scheduler, large mail box capacity, mobile device compatibility, etc… It’s also wise to account for growth and foreseeable changes in the company, making sure the tools will be able to scale and meet future requirements.  Once you have a list of the must haves, you may move on to the research phase.

2.      Who writes this stuff and which one do I pick?
Once you start your research process you will quickly learn that there are many tools available that can meet your needs. You will run across commercial, open-source, and may even consider outsourcing or using one of the SAAS products available. You can use the questions below to narrow down your options to two or three solutions that may work for you.

  • Does it have all the functions you need?
  • How well is it supported?
  • Is there an active developer and user community?
  • What are the existing users saying about it and are they happy?
  • Will it scale?
  • What are the license costs if any?
  • How much will it cost to support the tool?
  • Are you able to host it yourself?
  • Can it be outsourced or can you use a SAAS product?
  • Do you have the talent to support it internally?
  • How easy is it to use and maintain?

Your research should yield 2 or 3 prime candidates. Ideally you will end up with one open-source, one commercial, and one SAAS option for you to pick from. You can now weigh them against each other and pick the winner.

3.      Commercial,  open-source or outsource
It’s necessary to determine whether commercial, open-source or SAAS is the best option for your organization. All the pros and cons need to be considered in parallel with evaluating your organization capabilities, both technical and financial. Here are some things to consider about each of these options.

Commercial

Pros:
Commercial tools usually come with support which takes some load off the IT team hence requiring less personnel. It’s likely that commercial tools have been thoroughly tested and have fewer bugs and are more stable. When purchasing tools from the major vendors you can expect them to follow industry standards and be compatible with other systems and tools.

Cons:
Licensing costs can be a major deterrent when looking at commercial software as an option. Large software vendors can be slow to respond to feature requests and you may have to wait longer between new versions of software. Licensing costs will likely not include access to the source code and you may not be able to customize the tool for your liking.

Commercial is the way to go when you can afford the extra licensing costs and don’t want to staff specialists to maintain the tools. A careful costs-benefit analysis should reveal if commercial is a good option for you.

Open-Source

Pros:
There are no licensing costs associated with open-source products and this can be a big attraction when considering this option. Due to the code being readily available the tool can be customized to meet your needs. In a case of an active developer and user community, it will be relatively easy to request new features and get support when you run into problems. New and improved versions of the product are released often and are easily available for download. Open-source tools are often on the cutting edge of technology.

Cons:
It may require significant time and expertise to setup and maintain open-source tools due to the lack of documentation and support. You will likely have to retain a specialist to implement and support this option. This option carries a higher risk of bugs and stability issues that may take significant time to fix by the community. Another risk of going with open-source is that the development of new versions and updates may stall and be abandoned by the community. In this situation you may have to move to another solution and this can be costly and inconvenient to say the least.

Open-source is a great option when you have the manpower to implement it and maintain it. This option can keep your IT organization running lean. Do your due diligence on the research and planning to benefit from this option.

Outsourced and SAAS

Pros:
You pay for what you use and don’t have to retain specialist to setup and maintain the software. The tools are always available and are completely supported by the vendor. There is often no capital investment and you can pull out anytime you want.

Cons:
Outsourcing and SAAS can become expensive depending on the size of your organization and/or how much you utilize the software. SAAS solutions can be less flexible and are likely to have a set of features that are not customizable. You are dependent on the vendor for hosting and maintaining the tools and this carries obvious risks.

Outsourcing and using SAAS can be valuable depending on the organization and its unique resources and requirements.  For example this option can be a good choice for a small organization that can’t afford a large investment in people or software and its beneficial to only pay  for what they use. Once again do your due diligence to research this option to make sure it’s a good fit.

Security and Business Continuity
An IT organization should be prepared to handle and possibly prevent security incidents as well as be able to respond to disasters that cause interruption to the business’s critical services.  Although it’s impossible to anticipate and be prepared for all types of harmful events it’s a good idea to cover as many bases as possible. The cost of not being prepared is simply too high and although these types of events are rare it only takes one to wipe out a business or cause major financial damage. There are different strategies and solutions to address these issues and some of them carry a hefty price tag. I will talk about some affordable solutions that there is no excuse to ignore.

1.      The enemy on the inside
It’s important to keep sensitive internal information away from prying eyes and from being leaked outside the company. The key is to have defined guidelines on what resources different users have access to and having the tools to monitor and enforce them.

  • Have a central authentication mechanism like LDAP or Active Directory that enables granular access to resources. This gives advantages of easy administration, auditing, and ability to disable access quickly to all resources.
  • Keep auditing in mind by retaining access logs, access records and enabling auditing capability in your software and devices. Good auditing capability makes investigation and prevention of security incidents possible.
  • Monitor unauthorized access and suspicious activity on your systems by using various commercial and open-source tools. Automatic alerts of security events can provide early detection and prevention.
  • Implement a solid antivirus and malware detection and removal system on all systems
  • Enforce regular password changes, discourage password sharing, and educate users to be security minded.
  • Learn from your mistakes by taking preventative action and make sure similar security events don’t reoccur.

2.      Are the bad guys out to get us?
Being paranoid about security can take away focus from other important IT tasks, but ignoring it altogether can be naive. There is a good chance that your network has recently been scanned for vulnerabilities by hackers or one of your systems may have been compromised while you know nothing about it. It’s unlikely that your business will be singled out and personally targeted by hackers, but it’s likely that one of the thousands of running hacking scripts will gain sensitive company information or access to your systems. This information may then be given or sold to someone that will send out spam or compromise other systems using your hardware. A successful security breach can disable your mission critical systems, cause leaks of sensitive data, and put your business in a position of being liable for damages caused to your customers. I will talk about some good practices that can help protect your business from external security threats.

  • Design and maintain a secure network as it is your first line of defense and you must make it as impenetrable as possible.
  • Limit access using a firewall and allow access to only necessary internal services. You can also make sure the systems that are accessible from the outside are in an isolated DMZ.
  • Avoid using insecure protocols like FTP, HTTP, POP3 and others to transmit passwords.
  • Have a change control system to monitor all firewall changes.
  • Enforce strong passwords and regular password changes.
  • Implement one of the open-source or commercial intrusion detection systems.
  • Isolate wireless networks from the rest and utilize the highest level of encryption
  • Implement a solid antivirus and malware detection and removal system on all systems
  • Perform regular security audits on your own environment to find vulnerabilities or hire an outside party to do the audit.

3.      An Earthquake or user error
Your business relies on your mission critical systems to operate at all times. Natural disasters, power outages, and user error can cause data loss and your system’s downtime. Although natural disasters are rare, user error and hardware failures happen all the time. I will talk about some solutions that can prepare you for these types of events and help you overcome them.

  • Have a backup strategy which includes a good backup system, regular test restores, and sending tapes offsite
  • Keep detailed diagrams and technical information about your environment in case it needs to be rebuilt
  • Invest in your mission critical systems and make them highly available using a number of open-source and commercial solutions available.
  • Have protocols on how to respond to disasters,  data loss caused by hardware failures, or user error. It makes sense to review these regularly and even conduct training exercises.
  • In a situation where you have more than one office, it’s an option to utilize global load balancing and different types of replication to keep an exact replica of your mission critical systems at two or more sites. This can be a costly solution and doesn’t make financial sense for most businesses.
  • Make sure to have a good UPS (uninterruptable power supply) system in your datacenter or even a power generator if possible.

The Rest
I have touched on some of the important ingredients that make a healthy IT organization and likely missed some as well. The purpose of this blog was not to give a step by step recipe on how to run an IT organization, but instead to be used as a guide to find your own way and make fewer mistakes. You can’t build a great IT organization overnight, but it’s about progress not perfection.

ppragin Management , , , , ,

Scaling MySQL in the Web Environment

April 8th, 2009

What is “sharding”?

Today’s websites like MySpace and Facebook serve as many as 150k requests per second and thus require a well engineered architecture that is scalable and highly available. This is often achieved using a combination of custom designed applications, modified kernels, proxies, specialized network hardware, deployment systems, and creative database configurations. Although all of these components can be challenging to design and implement, the Achilles heel of any large web deployment is scaling the database layer.  Scaling MySQL can become very challenging especially with write-intensive applications. Various MySQL replication schemes facilitate scaling database reads quite well.  However, scaling writes leaves much to be desired since MySQL replication doesn’t allow writing to multiple database master servers. The MySQL NDB Cluster enables having numerous master servers that can be written to. Its performance, though, is subpar and this product is simply not ready for prime time in a demanding web environment. This leaves architects looking for other solutions to satisfy the write hungry web apps of today.

A more promising alternative employs MySQL “sharding”, a way to distribute MySQL data across numerous redundant database nodes. Each node may consist of many database servers using replication and hosting multiple shards. The web application can access the needed data for read and write queries using a database abstraction layer such as Hive-DB, Hibernate Shards or a custom solution. The abstraction layer sits between the application and database servers and is designed to make the numerous nodes and shards transparent to the application. It also determines where to allocate new data as well as where to retrieve existing data. The “sharding” model enables unlimited scalability on the database layer and is used by many high traffic web sites. Below, I describe a highly available and scalable architecture employing this model that can be used in a production web environment.

Scalable and Highly Available Web Cluster using “sharding”:

sharding-small

Application and Database layers:

The user’s http traffic comes in through the “web” load balancers and gets distributed to the application servers. The application servers are running Apache fronted with the Varnish cache proxy. The Varnish Proxy is used to offload traffic from Apache and serve cached content much faster via the proxy. By using the proxy, we significantly increase the number of requests per second that the application servers can handle and decrease page load times. By monitoring server and network usage with tools like Munin, we can determine when we are getting close to capacity on the application layer and add additional application servers whenever necessary. This architecture gives us the ability to easily scale the application layer and provide fault tolerance if any of the application servers fail.

A typical social networking site has millions of users whose information is kept in a MySQL database along with their posts and comments. This large user base and unpredictable growth potential requires scalability on the database layer. The need for growth on the database layer is fulfilled by using many database nodes where the different users, posts, and comments are distributed using “sharding”.

Every node consists of an “active” and “passive” database master server in a failover configuration using DRBD and Heartbeat. Each node also consists of numerous slave servers that use replication to stay in sync with the master server. Usage of DRBD and Heartbeat creates an active passive configuration where the failed Active Master server will be automatically replaced by the Passive Master server.

All the database nodes are positioned behind the “database” load balancers and each node has a read VIP and a write VIP. The write VIP is used to send queries to the active master server and the read VIP is used to load-balance read queries to the numerous slave servers. This architecture gives capability to scale database reads easily by adding additional slave servers to an existing node or scaling database writes by creating new nodes. Fault tolerance is achieved using DRB/Heartbeat and load balancing to eliminate single points of failure.

“Sharding” and the application:

Now that we have the application servers and numerous database nodes in place we need to find a way for them to work together. The most challenging part of the overall design is facilitating data access, such as allocating new user data and finding existing user data, complicated by the addition of numerous database nodes. We use a database abstraction layer to make these nodes transparent to the application as if there was only one database server. Hive-DB and Hibernate Shards are open source options providing database abstraction layer functionality.  One of these or a custom solution enables the application to issue queries while the abstraction layer handles locating the data on the numerous nodes. All the nodes have the same database schema, but different data is distributed throughout all the nodes. Data belonging to different users may be found on different nodes. The abstraction layers is aware of the read and write VIPs for all the nodes and  indexes certain values like user-ids in a directory database so that it can use them to locate the records on the different nodes.  When “sharding” is implemented correctly it can make scaling on the database layer seamless and allow for moving shards around different nodes as well as rebalancing them.

Although “sharding” is great for scaling on the database layer, there is no out of the box solution that will work with your application. You will need to invest quite a bit of time and effort in customizing the abstraction layer to make it work with your application. With this investment, you will eliminate many technical growing pains and let your business grow.

by Pavel Pragin / pavel@clearscale.net

ppragin MySQL

A Busy Day at the Office

March 27th, 2009

Today was a very busy day at the office, I felt like problems were coming at me from all directions. It’s often hard to keep focused on things that really matter with people and situations coming at you from all directions. When all the hubbub dies down and all the fires are put out you are still left with projects, planning, and relationship building that has to be done, but there is just not enough hours in the day. I am sure many of you can relate to my day at the office, but there is a better way, a more effective way, to manage your time that will bring you success in all areas of your life. Ask yourself these two questions and write down the answers:

  1. What is one thing you could do on a regular basis (you are not doing now) that would make a huge positive difference in your personal life?
  2. What in your professional life would bring similar results?

I have found that the key to effective time management is to organize and execute around priorities. Some priorities can vary from day to day, but others stem from the person’s character and are long term commitments to yourself and people around you. Some examples of these are relationship building, mentoring, problem prevention, recognizing new opportunities, P/PC (production and production capability) balance, delegation, or any activity that brings long term results in business or personal life. Stephen Covey talks about this in his book Seven Habits of Highly Effective People and provides an illustration called the “The Time Management Matrix” that can help you manage yourself more effectively. I would like to give credit to Stephen Covey for creating this diagram.

timemanagement

This diagram contains 4 quadrants with different types of activities. Using this diagram you can identify where you are spending most of your time and make corrections based on the suggestions below.

II. Not Urgent Important

Quadrant two contains issues that often don’t require our immediate attention and can be put off for later and unfortunately that’s often what happens. Important tasks such as relationship building, documentation, business development, employee mentoring, and training get left behind, because they don’t produce immediate results or recognition. The other enemies of tasks in quadrant two are interruptions, crises, and busy work. Effective people spend as much time as possible in quadrant two working on not urgent, but important issues. It’s rare to see senior managers running around with their heads cut off putting out fires and that’s mostly because they spend their time on low urgency high importance activities. Exercising prevention and anticipating problems can reduce interruptions and let you spend more time in this quadrant building your future success. Writing a blog like this is also a 2nd quadrant activity?!

I. Urgent Important

Quadrant one holds tasks we can’t say no to. These include such matters as deadlines, system outages, very angry customers, and things of that sort. Some of these issues are unavoidable and it’s normal to spend some portion of your time in this quadrant; however, focusing your efforts on quadrant two will shrink the amount of time you spend in quadrant one and make you more effective.

III. Urgent Not Important

Quadrant three includes issues that other people think are important, but are really not. Some of which are meetings about meetings, spam by coworkers, phone calls, or any problems that if ignored will not bring the company to a screeching halt and will be forgotten about in 2 hours. Don’t be acted upon, stay on track and out of this quadrant.

IV. Not Important Not Urgent

I know that none of us spend any time here, but for the sake of completeness…! News, YouTube, and other time wasters should be avoided at all costs and don’t contribute to your effectiveness. One of the reasons that people go to this quadrant is burnout from spending too much time in quadrant one.

The philosophy behind concentrating on not urgent important issues in quadrant two is to be opportunity minded by starving problems and feeding opportunities. Once you identify the shortcomings in your current self management style you will need to find a planning tool that will take you and keep you in quadrant two where you can be most effective. I recommend making a plan for the whole week by scheduling quiet time to write it. The weekly plan of course can be adapted to each day during the week as needed.

This weekly planning tool needs to have six qualities

  • Coherence

Helps you keep focus on your long term goals, priorities and plans and makes sure that all these synergize with values and principles important to you.

  • Balance

Lets you balance different areas of your life such as work, family, health and personal development

  • Quadrant II

Keeps you focused on quadrant two activities and away from crises.

  • People

Nothing is set in stone. Allow for spending time with people even if it throws you off schedule. Don’t forget the human element.

  • Flexible

The tool needs to be customizable and made to serve your needs.

  • Portable

The tool needs to be portable so you can take it anywhere and review it easily.

A planning tool with these qualities forces you to look at your week not in terms of crises and action items that need to be addressed, but more in terms of principle centered long term goals. This is an effective way to manage yourself, because it’s easier to commit and carry out tasks that are principle driven and that have personal value and meaning to you. You will be effective and feel good about what you do!

Let’s backup a little, and as an experiment look at the answers you have given to the questions in the beginning of this blog. What quadrant do you think they fall into?

I bet its quadrant two! So what are you waiting for? Stop procrastinating and get to it.

by Pavel Pragin / pavel@clearscale.net

ppragin Management

High Availability without Expensive Hardware

March 27th, 2009

The Problem:
A customer called us in a state of panic telling us that the MySQL server hosting the database for their application had a hardware failure and the data was not recoverable. We were asked to get them back online as soon as possible plus re-architect their environment to prevent a disaster from happening again.
The Goal:
Engineer a cost effective solution that will provide a highly available web server and database server architecture without using a load balancer.
The Solution:
A Highly Available web and database architecture utilizing open source software available for free on the Internet. This solution will consist of DRBD (Distributed Replicated Block Device) software used for replicating block devices over the network, Heartbeat software which provides death-of-node detection and MySQL Master-Slave configuration used to replicate databases to another host.
blog1

Overview of the Solution:
This solution allows for a failure of one web and/or database server without interruption of service using DRBD/Heartbeat and MySQL Master-Slave configuration. DRBD and Heartbeat allow you to configure one of the web servers to be in standby mode. No HTTP requests go the Standby server. The purpose of the Standby server is to be available in case the Active server fails, and take its place. The Standby server has identical system configurations, Apache configurations, and the “/var/www” directory holding the application files. Keeping the application files identical is accomplished using DRBD which runs on both servers and replicates “/var/www” to the Standby server. Heartbeat is a piece of software that runs on both servers as well and actively monitors that both the Active and the Standby server is online. Whenever the Active server becomes unreachable via Heartbeat the failover process will be initiated and the Standby server will be promoted to Active. During the failover process the Standby server assumes the primary IP address of the Active server, mounts “/var/www/” and starts Apache. Although DRBD and Heartbeat are different pieces of software, their configuration files are intertwined and both of them work together to make the failover process possible.
Up to this point we have been talking about highly available web servers, but what about database servers and making them highly available. This is where Master-Slave replication can be used. The currently active web server connects to the Master database server to execute queries by default. Whenever any changes are made to the databases on the Master all data is immediately synced to the Slave. In an event that the Master fails the applications logic will automatically route all the queries to the Slave making the failure transparent.
You may have noticed after looking at the diagram that the Slave database server and the Standby web server are on the same physical machine. This was done to eliminate cost of adding a dedicated Slave database server, and to utilize the hardware for two purposes. If you are able to allocate an additional server for this solution I would recommend a dedicated database Slave server.
How DRDB and Heartbeat works under the hood:
DRDB software runs on both web servers and replicates the underlying file system that is mounted on “/var/www” block by block over the network. This replication happens over a dedicated crossover cable hooked up to “eth1” interfaces on both servers. This dedicated link provides a high performance connection between the systems bypassing all network hardware like switches and routers that can fail. During normal operation the “/var/www” is only mounted on the Active server. This means that data is only written on the Active web server and then replicated to the Standby.
Heartbeat software runs on both web servers to monitor whether both nodes are online. The Heartbeat signal is sent over the crossover cable hooked up to “eth1” interfaces on both servers. The crossover cable connection insures that there is no latency or network timeouts that could trigger a false fail over when the signal does not reach one of the nodes.
Each Active and Standby server has one primary interface “eth0” which is bound to IP address of “192.168.1.10” or 192.168.1.20” and is used for connecting to the servers using “SSH” and can be called the system’s administration interface. The Active server also has a virtual interface “eth0:1” which is bound to an IP address of “192.168.1.50” and is used for HTTP traffic. In most situations this IP address is translated to a public IP on the Internet using NAT and is associated with a domain name. The magic of heartbeat takes place when the Active node becomes unreachable and the Standby node brings up “eth0:1” interface and binds it to “192.168.1.50”, mounts “/var/www/”, and starts Apache. After this sequence of events is completed the Standby server takes the role of Active. This process takes seconds and is usually invisible to the users. After the failover event Heartbeat monitors the original Active server and when it comes back online it can be setup to automatically fail over to the original state.


How MySQL Master-Slave works under the hood:

Replication enables data from a Master database server to be replicated to a Slave database server. The Master server records all queries that are executed in a binary log file and stores it on the file system. The Slave then connects to the Master, retrieves the binary log and replays the “Write” queries keeping itself in sync. This process can be very fast depending on the type of queries, and provides a working copy of the database on another server at all times and can be used in case the Master server fails. In this particular solution the application has logic that will automatically re-route queries to the working server in case one fails providing high availability on the database layer.

by Pavel Pragin / pavel@clearscale.net

ppragin HA and Scale