For years, Microsoft’s experimentation platform (ExP) (opens in new tab) has been the backbone of running A/B experiments at a global scale, analyzing results and enabling data-driven decisions for Microsoft products used worldwide. ExP started at a time when global data could be compiled and analyzed directly – allowing for comprehensive global insights. The game changed with emerging data compliance standards and in particular in order to support Microsoft’s EU Data Boundary (EUDB) pledge (opens in new tab) on May 6th, 2021—a commitment to ensure that EU customer data is processed and stored exclusively within the European Union. This commitment applies across all of Microsoft’s core cloud services – Azure, Microsoft 365, and Dynamics 365. While a milestone for data sovereignty, this pledge posed unique hurdles for ExP: How to maintain the speed and accuracy of experimentation in a world where data must respect strict boundaries?
Challenges and Solution
To immediately align with data boundaries, Microsoft products generated separate analyses for the EU and other regions, collectively referred to as RoW (rest-of-world). Each analysis produces a scorecard that includes metrics and statistical tests for A/B experiments. For example, calculating global Click-Through Rate (CTR) metric previously involved extracting data from a single dataset, summing up total clicks and impressions, and directly computing the CTR. However, with the introduction of data boundaries, CTRs for the EU and RoW must now be calculated separately, and the global CTR cannot be derived by simply averaging the regional CTRs.
This posed a critical question for ExP and Microsoft products to answer – how should we run experiments and understand the overall impact within the EUDB constraints? There were several proposed solutions – 1. Switch to managing traffic exposure independently per region, and analyze data independently in each region; 2. Continue managing traffic exposure globally, and only generate the analysis results for RoW region as it has higher traffic volume than EU; 3. Continue managing traffic exposure globally, and having scorecards created separately for EU and RoW before applying a meta-analysis to merge regional pre-aggregated results to obtain global analysis results.
ExP decided to go with the last option. The main considerations include but not limited to:
- Most Microsoft product features are shipped globally, except for localization features.
- Lack of global analysis results will expose the product to high risk of incorrect ship decisions. If we rely on results from one regional analysis for global decision making, it will not only reduce the metric power but also increase the chance of shipping bugs or introducing biased code to other regions.
- If ExP supported independent regional analyses for a candidate global feature, without providing an automated method for aggregating results from different regions, it would put undue onus on product teams to carefully account for distinct populations and arrive at a singular decision. In contrast, a global analysis ensures the full impact of the treatment is front-and-center in the decision-making process.
- More data boundaries may be introduced in the future: while options 1 and 2 were acceptable for some products in the short-term, the expectation remained that further regional fragmentation would carve away at the ability to assess global effects. Neither reading regional analysis results individually nor only using one regional analysis result for global ship decision is sustainable in that context. ExP should provide a solution that is scalable to more data boundaries.
Design of Scorecard Merging Solution
Design principles
Scorecard merging was the most desirable solution, and the most challenging solution. The end-to-end implementation would involve changes in almost every component on ExP platform. While designing the solution, we tried to follow a few key principles.
- Trustworthiness. It is the most critical principle of this solution. Product teams rely on merged results to draw business decisions, so we needed to make sure the merge algorithm is correct and trustworthy.
- Full coverage. Our solution should be applicable to all ExP customers at Microsoft regardless of their compute fabric or source data format. The merging algorithm must cover all types of metrics that ExP supports.
- Scalability and generalizability. The scorecard merging engine should be able to work at a large scale and should not introduce much delay to the scorecard availability. The solution must be easily adopted by product teams and applied to new data boundary scenarios in the future.
- Consistency. The ExP platform user experience should align with current experience, minimizing any added cognitive load for product teams.
Design overview
The ExP scorecard infrastructure is inherently complex. With the introduction of scorecard merging, it necessitates modifications across all modules of the process. Below is a demonstration of the workflow and the key components. Please note that this overview does not encompass the comprehensive details of all involved modules and processes.
Complexity of scorecard merge module
The scorecard merge module in the workflow is the statistics engine. It cannot simply apply average from the individual scorecards, as each type of metric will require different merge logic to get the right point estimate and variance.
For instance, consider a dataset containing \(N\) Randomization Units (RUs). For \(i=1,\cdots,N\), the \(i\)th RU has observation \( Y_i \). Then the RU level average metric we are measuring is \(\bar{Y} = \frac{\sum_i{Y_i}}{N}\), and the standard error \(\sigma_{\bar{Y}}\) from the standard deviation \(\sigma_{Y}\) is \(\sigma_{\bar{Y}}=\frac{\sigma_Y}{\sqrt{N}}\). Assuming there is no overlap among datasets, merge the results, we will need to have three intermediate statistics \((N, \bar{Y}, \sigma_Y) \) from each dataset \(\tau \in \{1, \cdots, t\}\) to calculate total number of RUs as \(N_{tot}=\sum_{\tau} N_{\tau}\), the average as \( \bar{Y}_{tot}=\frac{1}{N_{tot}} \sum_{\tau} ( N_{\tau} \bar{Y}_{\tau} )\), and the variance as
\( \sigma_{Y,tot}^{2} = \frac{1}{N_{tot}-1} \{ \sum_{\tau}(max(N_{\tau}-1, 0) \sigma_{Y,\tau}^{2} + N_{\tau} \bar{Y}_{\tau}^{2}) – N_{tot} {\bar{Y}}_{tot}^{2} \}\).
If the metric is calculated at a level that is more granular than RU, e.g. the experiment is randomized by user, and the metric the average number of clicks per session, we can no longer apply the formula above directly, because the sessions within the same user are not independent and identically distributed. Delta method will be applied in individual scorecard calculation as well as the merging process. The number of intermediate statistics from each dataset increases from 3 to 7 as covariance between quantities is required.
If the metric is a percentile metric, the methodologies used for average metric merging can no longer provide the right percentile point estimate and variance calculation. ExP uses Outer CI method [1 (opens in new tab)] for estimating standard error. The final statistics cannot be merged: however, the intermediate frequency distributions required for Outer CI are mergeable. This also required implementing additional temporary storage for the intermediate frequencies, as regional jobs can be submitted and finished up to a day apart.
As we can see, there is a significant amount of complexity in building the scorecard merge module. It not only requires development in trustworthy merge methodologies, but also requires huge platform efforts to identify the metric type and level, output correct intermediate statistics and apply corresponding merging algorithm at scale.
Validation of Scorecard Merging Solution
Validation is essential for establishing the merged scorecard’s reliability within large-scale experimentation platforms like ExP. This process confirms whether the merged scorecard matches the global scorecard, highlighting any discrepancies that require attention. ExP conducted both internal validation and external validation.
Internal validation
The internal validation tests the reliability of merging regional scorecards into a global scorecard. Using pseudo-regional data (global data partitioned post-hoc into regional data sources), we validate that the merged scorecard is identical to the global one, with key values such as metric values, standard deviation, Z-statistics, and p-values falling within computational error tolerance. Successful internal validation assures that the statistics and engineering engine for merging process is functioning as intended.
The internal validation ensured that on cleanly partitioned data, the merge logic was accurate. Data routing logic can be more complex or lossy in practice – introducing additional potential for discrepancies between the true global treatment effect and that obtained from merging of regional results. Assessing accuracy of scorecard merge in-situ required external validation.
External validation
ExP also performs external validation with product teams, which involves several steps and checkpoints (see Figure 2) and is more challenging due to real-world complexities, like users traveling across regions, cross-region users. Here, the focus is on confirming that merged metric values align with global metric values where the discrepancies fall within acceptable limits, rather than exact identical values in internal validation. External validation includes guardrail checks (focusing on consistency in shipping decisions) and sensitivity checks (examining key metrics like Z-statistics and p-values for significant deviations), as illustrated in Figure 3. To support this process – prior to boundary enforcements, telemetry data was temporarily double-pumped, where EU and ROW (Rest of the World) telemetry data are routed to both regional and global storage. This allowed for the creation of both merged and global scorecards.
Validation helped product teams and ExP identified multiple issues, e.g.
- Incorrect data splitting logic: Identifying data loss or overlap by comparing user counts between merged and global scorecards. Additionally, a data center tagging issue was found by analyzing regional scorecards against globally filtered scorecards for specific regions.
- Pipeline inconsistency: Ensuring regional data pipelines align with the global pipeline. By investigating an over 5% difference in user count between a regional scorecard and global scorecard filtered to the same region, teams identified discrepancy in raw data processing between the existing global and the new EU data pipelines.
- Merging algorithm edge case: Addressing issues like numerical precision errors, which were corrected after finding differences in extreme scorecard values.
Conclusions
In summary, ExP’s merged scorecard solution advances data analysis and decision-making by adhering to key principles:
- Full coverage. ExP supports a wide range of metric types (including average and percentile metrics at and below randomization unit level) and compute fabrics (e.g., Cosmos (opens in new tab) and Spark (opens in new tab)), providing comprehensive scenario coverage as pre-EUDB and accommodating varying EUDB compliance deadlines across different teams.
- Scalability and generalizability. As of October 2024, at least 17 Microsoft products onboarded to the ExP scorecard merging solution. In September 2024 alone, ExP generated over 37,000 merged scorecards for more than 2,800 tested new features, achieving a success job rate of over 99.8%. Additionally, the merged scorecard solution resolves issues with large analysis jobs that previously failed in the global pipeline due to compute limits. Large datasets are split into several smaller, disjoint sets, processed separately into smaller scorecards, and then merged into a complete global scorecard. This shows the high scalability, generalization and reliability of this solution.
- Consistency. This solution preserves a seamless user experience by consolidating regional scorecards into a single, global view, eliminating the need to manually review multiple regional scorecards. The user experience for experiment authoring and analysis remains consistent with pre-EUDB practices, ensuring a smooth transition for product teams without additional cognitive load.
Furthermore, by integrating regional data into one unified scorecard, teams avoid the risk of making inaccurate decisions based on incomplete data. In a recent example of a Microsoft product, 14.7% of over 4000 regional scorecards showed different scorecard level treatment effects compared to their merged global counterparts. This solution not only improves accuracy and efficiency but also strengthens Microsoft’s ability to make valid data-driven decisions compliant regulatory trends for data handling.
— Ada Wang and Hao Ai, Microsoft Experimentation Platform
References
[1] Deng, A., Knoblich, U., & Lu, J. (2018, July). Applying the Delta method in metric analytics: A practical guide with novel ideas. Chapter 4. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (pp. 233-242).