From 2010 to 2015, I was part of a real digital transformation for, what I would call, a stagnant IT organization. I will avoid names in these writings and focus on the transformation story itself.
In 2010, I was contacted by a former coworker regarding a very specific problem they were having with one of their web applications. The request was to see if I would come on part time, to fix this issue. This coworker was now a functional area manager on a local IT contract.
At the time I was an experienced developer. I started development professionally in 2001 and over the years had maintained relationships from former positions throughout. This is an important point that I will go into in a follow up article that defines characteristics of transformers.
I agreed to come on part time to help solve the immediate issue. This was an important decision because the symptom of this single application was actually a small piece of a much larger scale set of problems.
The organization I agreed to join was an IT contract for a civilian government agency in the US. The IT contract changes hands every 3 to 5 years, but by in large stays mostly the same through those transitions. The staff also remains largely unchanged through these transitions.
Let me lay down some numbers to give a bit of scale and bite to the challenge ahead.
- @180 Contractor Staff (developers, system admins, network admins, database admins, IT security, project managers)
- 96 web applications
- 3 large scale simulation software projects
- Management of 3 campus data centers
- Management of the campus HPC clusters
Time to Work
When I joined, I immediately began work solving the tactical issue with the single application. Turns out it was a misunderstanding of variable scope and session affinity in ColdFusion (the web programming language of choice on the contract). However, this single problem exposed to me how rampant the quality issues were on the systems, applications and components that our contract had reign over.
The web applications team was between 10-15 developers at any given time. There was a core set that had been on the contract for 10 years or more. This means that there was a very defined set of structure, process and also means that the culture was very well established. This is the team that maintained the 96 web applications mentioned above. My focus began here because these applications were some of the most visible IT assets in the organization. They also depended on the underlying infrastructure services that our contract managed.
After solving the initial problem, it led to another after another. I approached the functional area manager that was responsible for the web applications team and laid out my concerns along with some suggestions for improvement. She agreed that there were major issues and that we needed to take action to make changes.
At the same time, I began to establish a trusted relationship with the program manager of the contract. Unknown to me at the time, he was an exceptional and trusting leader. This would play an enormous part in the story later.
The very first step I took was to inventory all of the web applications that our contract was responsible for. I gather info about their environments, who maintains them, what tooling was utilized and current status of ongoing work (both enhancements and maintenance). This was a critical step in understanding our digital footprint for improvement. Prior to this the organization had no mechanism of centralizing this information.
The second step I took was to meet with each of the developers individually to get an understanding of their work load, competencies, mode of operation, ambition, and suggestions for improvement.
Several things stood out immediately. There was no ongoing, consistent training path or career planning occurring on this contract. The developers were essentially in the same place they were when I had previously worked with them in the 2001 time frame. The second was that all of them had access to the production environments and there was inconsistent use of version control for the application software.
Choosing Consistent Tooling
Because we were having a run of production quality issues on these web applications that were both reliability and security related, there was tremendous bad press coming down on this team and our contract for inability to deliver quality software.
This led me to the third step which was to decide on a standard version control tool. I chose Git because of its increasing popularity and the pull model which I thought was dramatically better than the push model of CVS or Subversion (which were the tools of choice on the contract for the few applications that were in version control).
I also chose to focus on adopting some of the Atlassian tooling (Jira for ticketing, Confluence for documentation, Fisheye for code visibility, and Crucible for code reviews).
The fourth step that I took was to develop a method to promote code through our 4 web lifecycle environments, DEV, TEST, SAT (system acceptance testing), and PROD. All of these environments except DEV were considered customer owned. This meant that those environments must be extremely reliable and consistent. I knew this would never happen if our developers had direct access to those customer environments (which they all did).
A very common method of development at the time on our contract was to copy a code file in production to the same name with something to identify the file change like a date or developer initials.
Yes, you read that right.
Our developers were making live coding changes in our production environments.
The worst offender of this was the campus intranet portal. This site was the first thing campus employees hit in the morning and the last thing they hit before they left to go home. That site actually didn't even have any environments other than production!
To compound the issues, it was extremely difficult to get access to infrastructure resources to stand up anything new. I made a few feeble attempts to get specific individual tools stood up like GitLab and some of the Atlassian suite. There was tremendous push back. What I could get was a single Linux virtual machine. I took it.
This became a box named devtools and because I convinced the infrastructure team to give me root on this vm, I was able to install whatever I wanted. I installed Git and several of the Atlassian tools on this single VM. The server was only accessible internally and was not a security threat to our production environment.
The next thing I did was develop a process for getting all of our web applications into Git repositories. I decided on a branch model which matched our environments, dev, test, sat and master. I chose master to represent production. There are a million branching methodologies out there. At this point in time, anything was better than nothing. This choice though allowed me to use Git hooks to launch a code promotion script (which lived and executed on the devtools vm). This meant a code commit to the test branch would automatically promote code to the TEST environment and the same for SAT and, eventually, PROD.
I tested and validated this tooling and process on one of our applications and documented how to do everything.
With a validated set of tools and the ability to promote code between the environments, I was ready to move. Change is never easy and this one was going to be rough. My program manager wanted me to pull the bandage off immediately, however I convinced him to take this approach one application at a time. This would contain the blast radius.
The very first action I took though was revoke all developer access to our production environments. While this was painful and disruptive, it was absolutely necessary for all that would follow.
With my complete application inventory, I had a priority listing of the application adoption of the new tooling. I had already created or imported the applications into Git. This meant that immediately, if developers wanted to make code changes, they could. Even if the code promotion tooling wasn't applied to their given application for all environments, they could still get code into DEV, TEST and SAT without anyone's assistance. And I knew that eventually these environments would become consistent. For PROD promotion, they had to submit a ticket to our infrastructure team and code would be moved manually into production.
The first few applications went a little slow and bumpy, but they helped us improve and clean up the process, as well as the documentation. It also allowed us to show momentum and success to leadership.
We also built training material off of the first few application moves that allowed us to teach all of the other developers Git, the Atlassian tools and our home grown code promotion mechanism.
I worked with the developers for each of the 96 applications and we successfully migrated 100% of our web applications into this process within 6 months from the initial concept.
As for promotion to production, I explained our promotion process to the infrastructure leadership (who controlled production). I showed that we could automate and control exactly what code got promoted to PROD. At first they insisted that there must be a human that moves the code to production. They said it could be a script, but a human must take action to move the code. I worked with one of our system administrators to create a script that would move code from a staging location to production. After 1 month and over a thousand executions of a human running that script, the infrastructure lead called me up and said we could have our promotion "button" back. I immediately automated the final push. We no had manual involvement only in our DEV environments.
It was interesting that part of the process for adopting each application was revoking developer access to TEST and SAT environments. This meant that in the end the developers only had access to DEV. The initial revolt against this revocation of privilege was quelled by seeing the simplicity and consistency of moving code through controlled version control and promotion.
A side effect of this was that all the calls or emails we used to get to "just make a quick change" were no longer actionable outside of the process. Imagine the effect this would have on overall software quality. There was now visibility into all code changes via Git and requests were being tracked via Jira. Code changes were tied to those requests via commit messages. Our developers could tell the customer that they literally did not have access to make the quick changes and that they would have to go into the backlog of working requests. Obviously we could reprioritize if needed, but it freed us from being distracted or making mistakes in production applications.
One other key choice I made was to befriend the IT security team. As I was working through this tooling, I started making regular appearances in their office. I would just start white boarding a problem I was working through. It would inevitably grab someone's attention and they would begin helping me work through all the security implications. The important thing about this was that it shifted their role in the solution from the cap on the end to being an integral part of the design and architecture. This meant that approvals at the end were simply dotting I's and crossing T's. It also meant they started being a "Yes, but" team instead of a "No" team.
We even created a fifth environment named QA that we utilized to mirror PROD. This allowed IT security and operations to run all of their intrusive scans and test anything like patches in this QA environment before anything was applied or changed in production. No more successful SQL injection attacks that corrupt databases by our own IT security team!!! These would now happen in QA, we would figure out the corrective action, validate, test, and then promote to production.
A year later, our web development team which had a terrible reputation for delivering crappy software was now a model citizen for application development across the entire organization. 100% of our web applications had complete history tracked in modern version control, documentation was being written and centrally stored, code promotion was completely consistent.
We started sharing and opening up our tooling and model to other software development teams. We spun up lunch and learn sessions for developers across the campus to attend.
Most of all, we became a trusted team for our customer that delivered reliable and consistent quality web applications.
Expand on Success
The skills and lessons I learned from transforming the web application team caught the attention of my program manager and the customer leadership. Remember earlier when I talked about the excellent leadership of my program manager and his trust. I had earned it and opened the door for me to expand my role in his organization.
He started introducing me to other program teams on the contract that were having issues. This ultimately led to us creating a specific position as Chief Technologist on the contract (a position that did not officially exist previously). I was invited to be a part of the weekly senior leadership meetings and began digging in all across our contract to completely transform the way we did business. The government customer ultimately wrote my created role into the RFP for the next iteration of the IT contract.
My plan is to write a series of articles that walk through this entire story. I wanted to start with the baseline so that all that follows makes sense.
Wherever you are in your transformation, I hope that my stories can encourage you to continue forward movement.