About Me

My photo
An experienced Data Professional with experience in Data Science and Data Engineering interested in the intersection of Machine Learning and Engineering.

Most Recent Post

2026-02-20

AB Testing != Feature Flagging

AB Testing != Feature Flagging

I hear this a lot. They seem the same, both flip experiences for users, both need a system to allow for this and some counts and analytics.

But essentially they different goals, even if teams can use the same codebase and infrastructure to implement and maintain these systems, they need to be considered differently. There is a different interface and thought process between turning one on, monitoring, and evaluating success.

The basics,

Websites will turn on different experiences for different groups of uses in order to

  • measure customer experience
  • evaluate internal systems 

Essentially to ask "Is this new setup better than what we had before?" 

Requirements

Let's start here with one system at a time, and then investigate any overlapping requirements.

The actual requirement is as simple as "is this change better?, should it always be used?"

Feature Flags

Feature Flags are internally focused. Their requirements are from engineering teams, for engineering teams, implemented by engineering team, and run by engineering teams and typically focus on 3 use cases,

  • Does this change make the infrastructure better?
  • Does this change meet the requirements asked for?
  • If this breaks, can we rapidly turn it off?

I don't think I've ever seen a feature flag that doesn't fit within these cases.

Latency & Scaling

If we change the underlying infrastructure like this, does it seems the same but cheaper, faster, or more robust?

Such as

  • Can we handle the same traffic at the same speed with 1 less node?
  • If we reorganize the JavaScript libraries, will loading be faster, with similar functionality?
  • Will Regionalized deploys work as expected?

Often individual users are not split here, but % of traffic to specific end points may be. So a single user may experience different infrastructure over a session or day, with the goal being that they don't notice the changes.

Most changes here are Engineers looking for large swings in specific metrics, with no change to other metrics. Systems work similarly all the time, so a high quality decision can be made quickly.

QA / Gradual Rollout

  • Can we rollout features all the way to production, and let QA/Stakeholders test at their leisure?
  • And
  • Can we rollout new features to a subset of users?

Teams will often use Feature Flags as a form of restricting who sees an experience in order to move fast with implementation, and allow other teams to move at a different pace, as well as gradually rolling out changes to ensure that systems, including human systems such as customer support, scale properly.

In this case, specific users or groups of users must be tracked to be allowed to see the changes.

Rapid Turn Off

This is used in situations where an experience or vendor breaking is known to incredibly impact a system or user experience. Often this can be turned off for a poor experience, but not catastrophically so.

For example, a payment processor may be experiencing stability issues. It's bad to not be able to accept payment, but worse if the payment hangs and user don't know if their order has gone through or not. Having a switch allowing the payment to be turned off can be a useful stop-gap while a more permanent solution is found.

AB Testing

AB Testing is typically focused externally, but can be turned internally as well. These tests come from a stakeholder trying to move some business metric. Since these metrics are usually highly dependent on who uses the system and what their intent is, they need statistical measurement to evaluate whether changes help or hurt a metric.

Groups of users are given different experiences in a highly controlled and tracked manner to allow for 

  • Statistical Design and Analysis
  • Allowing for High Quality Decisions

 Sophisticated AB Testing is extremely concerned with

  • Who sees the new experience?
    • Very specific groups will often be targeted.
  • When did they seen the new experience?
    • Data from before they see the change is treated different from after.

Teams will tune experiences to specific groups of users in order to increase metrics reliably and systematically. Without high confidence in the underlying system, stakeholders will be forced to make low quality decisions.

Feature Flagging compared to AB Testing

Commonalities here are

  • Specifics users see specific experiences and are locked into the experience

And differences are

  • % of traffic is sometimes acceptable for Feature Flagging, but never for AB Testing
  • Feature Flagging doesn't typically care about types of users seeing an experience, but AB Testing often does
  • Feature Flagging is looking for unchanged metrics or metrics with huge, obvious swings. Whereas AB Testing is teasing out 1% and 2% changes tuning an experience for the better.

Why does this conversation come up so much?

The underlying system and code is often the same for the two systems. Vendors exist and will sell both together, and programmers will often toggle between these experiences with identical lines of code.

The difference primarily exists in the instigation and analysis which are both outside of standard engineering responsibilities. So Engineering teams will view these systems as the same, and although they are not wrong, this difference can introduce subtle bugs, and a lack of confidence in AB Testing.

Alright then, what do we do?

Set expectations. Stakeholders and analysts can use consistent vocabulary and templates for documentation and ticketing. This communicates the differences to Engineering teams.

Engineering teams can solicit help from analysts on analyzing feature flags. Engineers can also be stakeholders eking out 1% and 2% changes via AB testing.

This is entirely a problem of

  • Communication
  • Documentation
  • Team Cohesion

If you have a single system that runs both, communicate and discuss this with stakeholders, analysts and engineers at every opportunity. Make sure everyone understands this one system may have different names and goals, but serves all equitably.

Conclusion

If you mean AB test, say AB test. If you mean Feature Flag, say Feature Flag.

Set goals beforehand, set expectations before a change, and communicate what success looks like early and often.

These are clearly different, with different goals, even if the underlying code is the same.