Skip to Content
Technical Articles
Author's profile photo Anderson SANTANA DE OLIVEIRA

Differential privacy can improve demographic parity and equalized odds

Optimizing model architecture for using DP-SGD decreases disparities across groups

in collaboration with : Anderson Santana De Oliveira, Caelin Kaplan, Khawla Mallat, Volkmar Lotz, and Tanmay Chakraborty

The responsible design and use of AI technology has received increased attention by society and business. Hence, Trustworthy AI is high on the priority list of companies that provide AI services, platforms, or applications with embedded AI. At SAP security research, we focus on the privacy and fairness dimensions of trustworthy AI: we investigate how to better support the design of AI with the use of technology to mitigate biases and to prevent unintentionally revealing information about the training data.

Photo%20by%20Eak%20K.%20from%20Pixabay

Photo by Eak K. from Pixabay

With these goals in mind, we started with a basic question: how are fairness metrics affected by differential privacy? It is known that training a model with differential privacy constraints has a disparate impact on accuracy, which results in underrepresented groups being more adversely affected [1]. However, what happens when you consider a given fairness notion such as demographic parity or equalized odds? Below we will see that the effect of differential privacy is beneficial most of the time.

Recent work has demonstrated that it is necessary to select an optimal model architecture when training a machine learning model with differentially private gradient descent (DP-SGD), rather than simply adding DP-SGD to the best identified non-private model [2]. Using datasets typically examined in ML fairness research, we show that by selecting the optimal model architecture with DP-SGD, the fairness metrics are negligibly impacted or differences across groups decrease when compared to the non-private baseline. Additionally, it is observed that the AUC score on the test set remains close for the best private and non-private models. These positive effects are not observing when one trains the best non-private baseline architecture with DP-SGD, suggesting that one needs to optimize the model hyperparameter search when using differential privacy.

The tables below show our results on test data for four distinct datasets with the following metrics: overall AUC — for all groups of individuals, maximum AUC difference across groups, maximum demographic parity difference across groups, and maximum equalized odds difference across groups. Here we used a differential privacy budget (epsilon) equal to 5.0. We set the decision threshold to 0.5, which is not relevant for the AUC scores, but this influences the demographic parity and equalized odds difference scores.

Overall%20AUC%20and%20maximum%20AUC%20difference%20across%20groups%2C%20with%20mean%20values%20and%20standard%20deviations%20on%2010%20model%20training%20runs.%20Mean%20and%20standard%20deviation%20values%20for%2010%20training%20runs%20for%20the%20best%20model%20configurations%20with%20and%20without%20differential%20privacy.%20The%20metrics%20mean%20the%20highest%20different%20across%20groups%20for%20each%20dataset.%20Image%20by%20the%20authors.

Overall AUC and maximum AUC difference across groups, with mean values and standard deviations on 10 model training runs. Mean and standard deviation values for 10 training runs for the best model configurations with and without differential privacy. The metrics mean the highest different across groups for each dataset. Image by the authors.

Mean%20and%20standard%20deviation%20values%20for%2010%20training%20runs%20for%20the%20best%20model%20configurations%20with%20and%20without%20differential%20privacy.%20The%20metrics%20mean%20the%20highest%20different%20across%20groups%20for%20each%20dataset.%20Image%20by%20the%20authors.

Mean and standard deviation values for 10 training runs for the best model configurations with and without differential privacy. The metrics mean the highest different across groups for each dataset. Image by the authors.

 

These conclusions are drawn after conducting a set of rigorous experiments that included an exhaustive grid-search of the best hyperparameters for the private and non-private models (e.g., activation function, optimizer, number of hidden layers, dropout probability). We measure the key group fairness metrics for each dataset to evaluate if the trained models satisfy equalized odds and demographic parity fairness notions.

Datasets

  • Adult: The UCI Adult dataset [3] contains US census income survey records. We use the binarized “income” feature as the target variable for our classification task to predict if an individual’s income is above 50k.
  • LSAC: The Law School dataset [4] from the law school admissions council’s national longitudinal bar passage study to predict whether a candidate would pass the bar exam. It consists of law school admission records, including gender, race, features like family income, and others derived from the GPA. The classification target is the binary feature “isPassBar”.
  • ACS Employment: The ACS employment dataset [6] is derived from the American Community Survey released by the US Census bureau, where the feature informing if a given individual was employed in the year of the data collection is transformed into the prediction target. Here, we use 10% of the survey results for the state of California in 2018. It includes features such as educational attainment, marital status, citizenship, employment status of parents, among others.
  • ACS Income: The ACS income dataset [6] was created as a replacement of the UCI Adult dataset. The task is to predict if a person’s income is above 50 thousand dollars. As for the ACS Employment dataset, we use a 10% sample of data from the ACS survey results for the state of California in 2018.

In the experimental results we show here, we considered each subgroup as the intersections between gender and race.

Fairness Notions

We briefly define the fairness notions we used in this work.

Demographic Parity

A predictor that satisfies the demographic parity constraint should yield that each subgroup should receive the positive outcome at equal rates.

Mathematically speaking, demographic parity can be expressed as follows, where Y′ is the predicted label and A is the protected attribute indicating the subgroup:

P(Y’ = 1 | A = a) = P(Y’ = 1 | A = b ), ∀ a, b ∈ A

Demographic parity notion helps preventing the reinforcement of historical biases and supports the underprivileged groups in short term due to the progressive enforcement of a positive feedback loop. However, ensuring demographic parity focuses only on the final outcome and not on the equality of treatment, which can lead to a problem called laziness, where a trained model selects the true positives of the privileged groups while selecting randomly (with a coin toss) subjects from the underprivileged groups as long as the number of selected subjects from each subgroup is valid. In addition, demographic parity could be used in inappropriate context, where the disparity is truly present but is not related to a protected attribute.

Equalized Odds

A predictor that satisfies the equalized odds constraint should predict each of the labels equally well for all the subgroups.

Equalized odds can be expressed as follows, where A is the protected attribute characterizing the subgroup, Y’ is the predicted label and Y is the ground truth label:

P(Y’ = 1 | Y = y, A = a) = P(Y’ = 1| Y = y, A = b), ∀ y ∈ {0,1} and ∀ a, b ∈ A

Unlike Demographic parity, equalized odds can ensure the equality of treatment, which eliminates the laziness issue above-mentioned. However, the equalized odds notion would not help dealing with bias in the long term. It does not take into consideration potential bias outside of the model, which can lead to enforcing the bias, particularly in cases where an extreme disproportionality is present between different subgroups.

The Impossibility Theorem of Fairness [9] states that demographic parity and equalized odds are mutually exclusive, meaning that no more than one of the fairness notions can hold at the same time.

Experiments

We defined a basic neural network architecture for tabular data in Pytorch. We do a 5-fold cross validation hyperparameter search on each dataset to find the best hyperparameter configuration with the goal of maximizing the overall model AUC score. This metric is appropriate because it is insensitive to class imbalance and to the decision threshold. We performed the hyperparameter search four times for each of the datasets:

  • With and without the protected attributes (gender and race — remark that in this study we kept all the original race categories present in the dataset);
  • With and without differential privacy — we applied DP-SGD using Opacus [7].

The grid search was performed over the following configuration ranges:

  • Learning rate: 1e-4, 1e-3,1e-2, 1e-1
  • Dropout probability: 0.0, 0.1, 0.2
  • Number of hidden blocks: 1 to 3
  • Batch size: 256, 512
  • Activation: Relu, Tanh
  • Optimizer: Adam, SGD
  • Max gradient norm (for differential privacy only): 0.01, 0.1, 1

After, we trained the model ten times using the best configuration for each of the four cases (with/without privacy, with/without the protected attributes). We used a hold-out test set (thus not used during the 5 folds cross validation phase) to collect all the metric values that were computed with Fairlearn [8].

We also trained the best non-private model configuration with the addition of differential privacy. In this case, we did not adjust any other hyperparameters. Using the default model with differential privacy approximates what has occurred in most previous research, where differential privacy was added to the baseline models with minor adjustments (perhaps batch size or learning rate).

In the following the results, we decided to omit the case where we did not use the protected attributes, to make this article concise. Overall, most metrics are better when the protected attributes are present in the training data.

Results

Below we discuss the detailed results for each of the datasets.

ACS Income

In the plots below we present several metrics for the ACS Income dataset. The blue boxes represent the best model configuration with differential privacy. The orange boxes represent the baseline model, which is the best model configuration without differential privacy. The green box uses the same configuration as the baseline model, but we train with differential privacy. We can observe that differential privacy decreases considerably the demographic parity, equalized odds difference, and AUC difference, while maintaining a very close overall AUC when compared to the baseline model. Simply adding differential privacy to the baseline model presents poor results overall.

Several metrics for the ACS Income dataset. Image by the authors.

The plots below display the mean values and standard deviation over 10 model training runs for accuracy, false positive and negative rates, AUC score, and selection rate for each group combination among the race and gender attributes. While we can observe differences in the private and non-private versions of the models, it does not seem to consistently penalize a given minority group.

Below, the performance metrics per group plots contain the following group combinations in this order (listed because they become unreadable in the plot):

  • G1: American Indian
  • G2: Amerindian Tribes
  • G3: Asian
  • G4: Black
  • G5: Native Hawaiian
  • G6: Some Other Race
  • G7: Two or More Races
  • G8: White
Several performance metrics by group of individuals in the ACS Income dataset. Image by the authors.

LSAC

Below we can see the fairness metrics results for the Law School Admissions dataset. This is a more standard fairness problem since the majority groups are white-male and white-female.

Fairness metrics for LSAC. Image by the authors.
Performance indicators by each subgroup of individuals in LSAC. Image by the authors.

We can see here a more accentuated drop in performance (Accuracy and AUC) when using differential privacy. But notice that the greater intensity of the accuracy drop for black males is explained by the higher false positive rate, which combined with a smaller false negative rate for this minority group, would necessarily improve its selection rate; thus improving fairness overall.

Adult

This is a controversial but widely studied dataset for fairness research — it has been criticized, for instance, because the 50k income threshold makes the class imbalance too high. We report our results below. We decided to maintain all the racial groups that appear in the original dataset. This explains the disparities we can observe in the metric values because some groups are poorly represented. Still, differential privacy helped to decrease the differences across groups.

Fairness metrics for the Adult dataset. Image by the authors.
Performance metrics by group in Adult. Image by the authors.

We remark that, for this dataset, the performance metrics drop in the differentially-private model remains small across groups.

ACS Employment

In this dataset, differential privacy’s beneficial effects are easily visible.

Fairness metrics results for the ACS Employment dataset. Image by the authors.

The same groups for ACS Income are found in the performance metrics per group below:

Performance metrics by subgroups for the ACS Employment dataset. Image by the authors.

For certain metrics and groups, we can see considerable variations in FPR and FNR that happen in the baseline model. The noise introduced by differential privacy seems to have a balancing effect, by bringing a smaller variation for certain metrics on a given subgroup.

Discussion

Differential privacy certainly reduces the model utility. We can observe slight decreases in the overall AUC scores. On the other hand, differential privacy brings considerable changes to the False Positive and Negative rates. This softens the disparities across groups for all datasets. The demographic parity differences were drastically reduced in all datasets, while equalized odds differences either remained almost equal (for LSAC) or were reduced (a 4% reduction for Adult, 19% reduction for ACS Employment, and 55% reduction for ACS Income).

Previous research has insisted on the fact that differential privacy adversely impacts accuracy for underrepresented groups, while this may be the case when adding differential privacy to an existing non-private model configuration, by using a model optimized for privacy one can observe that the resulting model can often be more “fair” than its non-privacy preserving counterpart.

References

[1] Bagdasaryan, E., Poursaeed, O., & Shmatikov, V. (2019). Differential privacy has disparate impact on model accuracy. Advances in Neural Information Processing Systems, 32.

[2] Tramer, F., & Boneh, D. (2020). Differentially private learning needs better features (or much more data). arXiv preprint arXiv:2011.11660.

[3] B. Becker R. Kohavi. 1996. UCI ML Repository. http://archive.ics.uci.edu/ml

[4] L. F Wightman. 1998. LSAC National Longitudinal Bar Passage Study. LSAC Research Report Series. (1998).

[5] J. Angwin, J. Larson, S. Mattu, and L. Kirchner. 2016. Machine bias: There’s software used across the country to predict future criminals and it’s biased against blacks. ProPublica (2016).

[6] Ding, F., Hardt, M., Miller, J., & Schmidt, L. (2021). Retiring adult: New datasets for fair machine learning. Advances in Neural Information Processing Systems, 34.

[7] https://opacus.ai

[8] https://fairlearn.org/

[9] Friedler, Sorelle A., Carlos Scheidegger, and Suresh Venkatasubramanian. “The (im) possibility of fairness: Different value systems require different mechanisms for fair decision making.” Communications of the ACM 64.4 (2021): 136–143.

Discover how SAP Security Research serves as a security thought leader at SAP, continuously transforming SAP by improving security.

Assigned Tags

      Be the first to leave a comment
      You must be Logged on to comment or reply to a post.