The Search pipeline consisted of at least a thousand services. In order to deploy something of that scale, you have to employ robust tools & solid processes. Especially, when the team consists of several thousand engineers. Normally, each service is managed by a team. It resides in its own source control repository. Sometimes services depend on base-services, so you would have to do the forward/reverse integrations before deployment.
Prior to deployment, you would need a deployment plan (generally created at the beginning of the feature sprint). The service would be developed locally but deployed to a dev-environment (manually back in the days). This is where the dev team would get to test it. There would be a test cycle and bug-fixing phase that you would go through, followed by green-lighting meetings with the service team, partners & stakeholders. In this meeting, they discuss potential risks and dependencies and delays on other teams, fallback, etc.
When everything is good, all the dependent services get pushed their respective test environments and they undergo some sort of test-automation run which automatically creates bugs. This will lead to a bug-triaging process, followed by bug-fixes and deployment. Everything has to pass here before it can be upgraded to the integration environment. This is where they will test all the dependencies as a whole, etc.
Once everything passes, the services are moved to staging (aka pre-production), where the service/feature is tested by the entire team, generally through a scheduled bug-bash.
Once deployment is done, you have to ensure that logs, monitors, alerts are all functional up. We had to also eyeball the service/UI to ensure everything looks good.
Also, sometimes new features are flighted (exposed to ~3% of the userbase). This allows the team to ensure that there are no catastrophic failures. If something were to go wrong, it doesn't affect the entire userbase.
After deployment, the PM/Testers monitor that service for a few days. They will look at usage analytics, precision-recall measurements in some cases, and other statistics. They might decide that the feature is not used by end-users and make a decision to pull it back.
This process was mainstream before I left Bing in 2012. Now, things are different from what I heard.
Team sizes vary. Usually, a lead may have up to 5-6 developers under them. A development manager might have 2-5 leads and a couple of Senior IC's. Principal Development Managers may have up to ~32 people on their team. People tend to wear the # of subordinates as a badge of honor!
There are a few important roles in the organization. You have Software Development Engineers (SDE) and Software Development Engineers in Test (SDET or SDE/T) (90% of tester roles were eliminated from Bing in 2012 I think as it was a joke!). Then there are Software Test Engineers (STE) which are basically manual testers for certain roles. You might get someone watching images all day to identify porn for instance or redlines matching or simply executing scenarios manually before things are automated.
Then, there are Program Managers who manage a feature on a team and there are Release Program Managers who are responsible for the coordination of certain pipelines (a bunch of feature teams pretty much). They will communicate delays or changes, etc. across the larger feature teams and ensure everyone is on the same page.
There's also a Product Manager who is the liaison between the Business side (Biz dev manager) and Engineering side (Program Manager). It's all a well-oiled machine and a marvel to observe. Everyone has their own little thing to do. However, these processes and redtape kill agility and innovation at times.
Every team is different. Most teams in Bing followed a Scrum / Sprint style. Some teams used a KanBan board of post-its, whereas others used Microsoft Project.
There's a lot more to it with source code branching that coincides with the teams, stages, features, versions, etc. There's a live branch for quick bug fixes. There's a hotfix branch. There's an out-of-band deployment for breaking changes, etc.
In the past, most teams released every 3-6 months and that was considered fast. I was on a team that changed things and started releasing every 2-4 weeks and that stirred the pot. I've heard that they might be doing daily releases now. Not sure entirely. I also forgot to mention things like, our build took about 18 hours initially. When I joined, it was down to ~10 hours. Over time, things got better because we had teams dedicated to optimizing the build process. I think we got it down to as low as 30 minutes for some teams. Also, deployments are isolated as much as possible. Sometimes you deploy one service, whereas other times you may have to deploy 5 services or wait for other teams to deploy. That schedule is also managed by release managers.
Overall, it is a nightmare when you think about deploying across data-centers, maintaining version compatibility, ability to revert, monitoring all that stuff, etc. However, tooling has been the key. That along with the effort of some brilliant people, they have constantly improved the process.
How much did it take to go on production for each product ?
And what Agile framework do the fellows there use ?
For such distributed systems, how does team management work? small teams or how ?
What maintenance model does your team follow ? Sprints & new releases ? or when the managers put the requirements for the dev team(s) ?
The Search pipeline consisted of at least a thousand services. In order to deploy something of that scale, you have to employ robust tools & solid processes. Especially, when the team consists of several thousand engineers. Normally, each service is managed by a team. It resides in its own source control repository. Sometimes services depend on base-services, so you would have to do the forward/reverse integrations before deployment.
Prior to deployment, you would need a deployment plan (generally created at the beginning of the feature sprint). The service would be developed locally but deployed to a dev-environment (manually back in the days). This is where the dev team would get to test it. There would be a test cycle and bug-fixing phase that you would go through, followed by green-lighting meetings with the service team, partners & stakeholders. In this meeting, they discuss potential risks and dependencies and delays on other teams, fallback, etc.
When everything is good, all the dependent services get pushed their respective test environments and they undergo some sort of test-automation run which automatically creates bugs. This will lead to a bug-triaging process, followed by bug-fixes and deployment. Everything has to pass here before it can be upgraded to the integration environment. This is where they will test all the dependencies as a whole, etc.
Once everything passes, the services are moved to staging (aka pre-production), where the service/feature is tested by the entire team, generally through a scheduled bug-bash.
Once deployment is done, you have to ensure that logs, monitors, alerts are all functional up. We had to also eyeball the service/UI to ensure everything looks good.
Also, sometimes new features are flighted (exposed to ~3% of the userbase). This allows the team to ensure that there are no catastrophic failures. If something were to go wrong, it doesn't affect the entire userbase.
After deployment, the PM/Testers monitor that service for a few days. They will look at usage analytics, precision-recall measurements in some cases, and other statistics. They might decide that the feature is not used by end-users and make a decision to pull it back.
This process was mainstream before I left Bing in 2012. Now, things are different from what I heard.
Team sizes vary. Usually, a lead may have up to 5-6 developers under them. A development manager might have 2-5 leads and a couple of Senior IC's. Principal Development Managers may have up to ~32 people on their team. People tend to wear the # of subordinates as a badge of honor!
There are a few important roles in the organization. You have Software Development Engineers (SDE) and Software Development Engineers in Test (SDET or SDE/T) (90% of tester roles were eliminated from Bing in 2012 I think as it was a joke!). Then there are Software Test Engineers (STE) which are basically manual testers for certain roles. You might get someone watching images all day to identify porn for instance or redlines matching or simply executing scenarios manually before things are automated.
Then, there are Program Managers who manage a feature on a team and there are Release Program Managers who are responsible for the coordination of certain pipelines (a bunch of feature teams pretty much). They will communicate delays or changes, etc. across the larger feature teams and ensure everyone is on the same page.
There's also a Product Manager who is the liaison between the Business side (Biz dev manager) and Engineering side (Program Manager). It's all a well-oiled machine and a marvel to observe. Everyone has their own little thing to do. However, these processes and redtape kill agility and innovation at times.
Every team is different. Most teams in Bing followed a Scrum / Sprint style. Some teams used a KanBan board of post-its, whereas others used Microsoft Project.
There's a lot more to it with source code branching that coincides with the teams, stages, features, versions, etc. There's a live branch for quick bug fixes. There's a hotfix branch. There's an out-of-band deployment for breaking changes, etc.
In the past, most teams released every 3-6 months and that was considered fast. I was on a team that changed things and started releasing every 2-4 weeks and that stirred the pot. I've heard that they might be doing daily releases now. Not sure entirely. I also forgot to mention things like, our build took about 18 hours initially. When I joined, it was down to ~10 hours. Over time, things got better because we had teams dedicated to optimizing the build process. I think we got it down to as low as 30 minutes for some teams. Also, deployments are isolated as much as possible. Sometimes you deploy one service, whereas other times you may have to deploy 5 services or wait for other teams to deploy. That schedule is also managed by release managers.
Overall, it is a nightmare when you think about deploying across data-centers, maintaining version compatibility, ability to revert, monitoring all that stuff, etc. However, tooling has been the key. That along with the effort of some brilliant people, they have constantly improved the process.
Woohoo... a heck of comment.
Thank you for the well-detailed comment, I really appreciate it :)
I can see now why such products barely have bugs with the end users !