Is use of the general system justification scale across countries justified? Testing its measurement equivalence

System justification is a widely researched topic in social and political psychology. One major measurement instrument in system justification research is the General System Justification Scale (G-SJS). This scale has been used, among others, for comparisons across social groups in different countries. Such comparisons rely on the assumption that the scale is measurement equivalent. However, this assumption has never been comprehensively tested. Thus, the present two studies assessed the measurement equivalence of the G-SJS following classic measurement equivalence guidelines (i.e., multigroup confirmatory factor analyses) in Study 1 and using a new method for comparing larger numbers of groups in Study 2 (i.e., alignment optimization). In Study 1, we analysed the measurement equivalence in Great Britain ( n = 444), Germany ( n = 454), and France ( n = 463). In Study 2, we used a publicly available dataset consisting of 66 samples from 30 countries ( N = 13,495) to again assess the measurement equivalence of the scale. Results indicated (partial) metric equivalence, but not scalar equivalence in both studies. Overall, the studies indicate that mean comparisons across the examined countries are not warranted with the current form


Theoretical background
What is system justification?System justification can be defined as the motivation to bolster and defend the societal status quo ( Jost et al., 2017).According to system justification theory, people are motivated to defend and justify the existing economic, social, and political system.For instance, the social system can be justified by evaluating high-status groups as deserving of their positions rather than admitting that the current social order may be unjust.One important aspect of the theory is that it predicts systemjustifying motivations not only among those who profit from the societal status quo, mainly highstatus persons, but also among those who are disadvantaged by the system.Socially disadvantaged people can justify a system that provides them with little support, meaning that their electoral decisions, therefore, do not necessarily favor parties that would change the system to their advantage; they engage in such justifications to feel better about existing inequalities and power asymmetries (van der Toorn et al., 2011).
After initially focusing on stereotyping and prejudice, system justification theory research expanded to account for a wide range of outcomes ( Jost, 2019).For example, system justification theory has been used to explain differences in political ideologies ( Jost et al., 2017), justification of downsizing (Richter & König, 2017), and appraisals of pay entitlement (O'Brien et al., 2012).System justification has also been found to be an important contributor to resistance to change ( Jost et al., 2017).Additionally, several factors have been identified that predispose individuals to justify the system, including a strong national identification (Carter et al., 2011) and perceived powerlessness (van der Toorn et al., 2015).

K E Y W O R D S
measurement equivalence, measurement invariance, multi-country comparison, system justification Cross-country comparisons and measurement equivalence of the G-SJS System justification theory assumes that all people are generally motivated to justify the system to a certain extent, but the level of this motivation varies both between and within people ( Jost, 2019).The G-SJS (Kay & Jost, 2003) is intended to measure this variability between people in system-justifying tendencies.Although this scale has been widely used as a measure of system justification motivation, it has not been without theoretical critique: Some researchers rather interpret it as a measure of the status quo perception instead of a measure of a motivation (Owuamalam et al., 2018).Furthermore, some items of the scale have been argued to measure other constructs than system justification.In particular, the item "[country name] is the best country to live in" might rather assess national attachment instead of system justification (Owuamalam et al., 2019).National attachment is, however, theoretically different from system justification and also varies between countries (Becker et al., 2017).
Despite these criticisms, the G-SJS has been used to examine system justification in different social groups and countries around the world (e.g., Brandt et al., 2020;Vargas-Salfate et al., 2018;Zmigrod, 2020).Although this cross-country research represents only a fraction of the research on system justification, such research is critical for system justification theory because the theory seeks to describe general processes that are broadly applicable to people in different societies with different political systems.For example, Cichocka and Jost (2014) compared system justification scores between capitalist and post-Communist countries.Another study found that system justification was positively related to life satisfaction and negatively related to depression across 18 countries (Vargas-Salfate et al., 2018).Furthermore, Brandt et al. (2020) assessed participants' social status and perceived legitimacy of the social system across 30 countries and 66 samples.They found that people with higher status assessed the social system as more legitimate than those with lower status.At the same time, they also reported considerable variation across people and countries, emphasising the relevance of ensuring that the measurement instruments used capture the same construct across individuals and countries.
A prerequisite for cross-country comparisons is that the measurement instruments used assess the same construct in the same way in all countries (i.e., the instruments are measurement equivalent or measurement invariant -both terms are used synonymously; Vandenberg & Lance, 2000).This implies that participants from different countries with the same level of the latent construct of interest should provide the same manifest response on the measurement instrument (Hirschfeld & von Brachel, 2014).If the ME has not been established, it is unclear whether and to what extent similar responses indicate similar levels of the latent construct.In this case, results from different countries cannot be easily compared.
The need to establish ME is often overlooked in cross-country and cross-group research (Flake et al., 2017), including research using the G-SJS: System justification motivation and its relations with other constructs might differ across countries or other groups because of true differences in this form of motivation or because respondents in different countries or from different groups other than countries understand the scale items differently.Such differences could be caused by incorrect translations or different interpretations in different political, economic, and social contexts.
A lack of ME can distort statistical conclusions and practical implications based on analyses of observed mean differences between groups (N.Schmitt & Ali, 2014).Hence, manifest differences between group samples might be misattributed to real-group differences, while in fact they merely reflect different interpretations or understandings of the scale items.Conversely, manifest non-differences between group samples might be falsely misattributed to a lack of real differences between groups.Hence, ensuring ME of the G-SJS across countries is important to provide cross-cultural validity to research using this scale (see Osborne et al., 2019, for a discussion of cross-cultural generalisability of the system justification theory).
There are different types of ME: configural, metric, and scalar equivalence (see supplemental online material [SOM] S1 for further explanations; Vandenberg & Lance, 2000).Scalar equivalence is a prerequisite for comparing mean values across groups and for multilevel analyses because it implies that the scales have the same operational definition across groups (Cheung & Rensvold, 2002).Hence, scalar equivalence would have been needed to compare the means of system justification across countries, as done by Vargas-Salfate et al. (2018) in their preliminary analyses, or conduct multilevel analyses, as in the study by Brandt et al. (2020). However, Vargas-Salfate et al. (2018) reported themselves that the short versions of the G-SJS they used did not show to be invariant; hence, their comparisons of mean values should be interpreted with caution.
Furthermore, the practical importance and magnitude of non-equivalence should also be considered.Statistical evidence for measurement non-equivalence does not necessarily imply severe distortions in the interpretation of measurement scores from different groups, particularly in large samples.Thus, researchers have been advised to calculate an effect size known as "d MACS " (with MACS = mean and covariance structure) on the item level and, as an unstandardised effect size, impact on the scale level (Nye & Drasgow, 2011).The main aim of the present study was to examine the measurement equivalence of the G-SJS and to quantify the magnitude of non-equivalence, if any (RQ1).
In addition to its main aim, that is, testing ME, the present study also examined aspects of the G-SJS's convergent validity (Kay & Jost, 2003).Ensuring convergent validity is important to verify the functioning of the scale across countries.Not only should participants in different countries understand and interpret the scale items in the same way (ME) but the scale should also exhibit similar relationships with other relevant constructs across countries (convergent validity).
Two constructs well-suited to test for the G-SJS's convergent validity are political orientation and willingness to strike.System-justifying outcomes are often interpreted as indicators of conservatism ( Jost et al., 2017) as the needs underlying system justification are also at the root of political conservatism ( Jost et al., 2003) with conservatism defined as containing the two interrelated aspects of resisting social change and accepting equality ( Jost et al., 2003( Jost et al., , 2009)).Jost et al. (2001) concluded that motivation to defend the existing social system is weaker among supporters of left-wing ideology than of right-wing ideology.Conservatism can, thus, be called a "prototypical system-justifying ideology" ( Jost et al., 2003, p. 63) because the main components of conservative belief systems focus on acceptance of inequality and opposition to change (Huntington, 1957).Jost (2019) inferred that system justification is almost always positively related to the endorsement of politically conservative ideologies.This relation has been shown for samples from Great Britain (Zmigrod et al., 2018) and Germany ( Jost, 2019).France is the only country in which a negative correlation between system justification and conservatism has been found (Langer et al., 2020): System justification was associated with liberal-socialist attitudes and with liberal/leftist preferences regarding immigration and welfare, contradicting results from other countries and theoretical arguments.Langer et al. (2020) attributed the results from France to the Enlightenment ideals of 'liberté, égalité, and fraternité' stemming from the French revolution, which are still deeply entrenched in France, up to the point that they might represent the societal status quo.Although this might be a valid explanation for their results, varying correlation patterns with key variables across countries could be problematic for the international comparability of the scale.Thus, the question of whether the relation between system justification and political orientation is similar across countries is critical for future research.Next to its relation with political orientation, system justification can have many consequences for political behavior, including participation in collective action ( Jost et al., 2017).For example, willingness to protest for change can be undermined by system justification as it fosters resistance to change ( Jost et al., 2012).Hence, system-justifying beliefs should be negatively associated with support for protest and willingness to protest.One form of collective action is people's willingness to strike to collectively enforce jointly held economic or other goals against employers.We, therefore, used willingness to strike as a second criterion for the G-SJS's convergent validity in Study 1. Willingness to strike can be a generalisation of dissatisfaction with one part of one's work to other parts and a function of dissatisfaction in many areas (Stagner, 1956).In Study 1, we had the opportunity to examine the convergent validity of the G-SJS, in that we assessed whether the reported associations between system justification and political orientation by Jost (2019) and Langer et al. (2020) could be replicated in different samples (RQ2a) and whether system justification is negatively correlated with willingness to strike in Great Britain, Germany, and France (RQ2b).
In summary, the objective of this research was to examine the ME of the G-SJS.In Study 1, we drew on a sample of three countries: Great Britain, Germany, and France.In Study 2, we extended the scope of our analyses to 30 countries by using the dataset from the study Brandt et al. (2020).Additional analyses examined convergent validity of the G-SJS in Study 1.

ST U DY 1 -M ET HODS
The data were collected in the context of a larger preregistered study pursuing a different research question (https://aspre dicted.org/blind.php?x=tx4q7x).Corresponding results can be found in a different article (Vesper & König, in preparation).A list of all collected variables and the data can be found at https://osf.io/34p9j/?view_only=5e248 dda2c 18460 289bf 75a71 d430c5d.

Sample
Participants were recruited through an online panel provider in Great Britain, Germany, and France and received 0.50 €/0.50 £ as an incentive.A total of 1652 persons completed the study.Since these data were primarily collected for another study concerning attitudes toward strikes, people who were not currently employed were screened out (n = 92).Participants who did not permit their data to be used for scientific purposes were also screened out (n = 33, following Meade & Craig, 2012).Several steps were taken to ensure data quality (Meade & Craig, 2012): First, to deal with overly swift completion, all participants were excluded who completed the items faster than a rate of two seconds per item (n = 78, Huang et al., 2012).Second, if participants chose the same response option for more than six successive items, their data were excluded (n = 88, Niessen et al., 2016).This exclusion procedure was pre-registered.After this procedure, N = 1361 people remained in the final sample.
In the final sample, the mean age was 46.33 (SD = 10.03) and 66.9% of the participants were female.In the British sample (n = 444), the mean age was 46.82 (SD = 10.68) and 65.8% were female.In the German sample (n = 454), the mean age was 44.80 (SD = 10.64) and 65.4% of the participants were female.In the sample from France (n = 463), the mean age was 47.36 (SD = 8.53) and 69.3% were female.

Materials
We used the original eight-item G-SJS (Kay & Jost, 2003) for the British sample, the German translation by Ullrich and Cohrs (2007), and the French translation used in Langer et al. (2020), which we received from P. Vasilopoulos (personal communication, January 17, 2020) to measure system justification.Items were rated on a nine-point scale ranging from 1 = "Do not agree at all" to 9 = "Agree completely" (see SOM, Table S1 for a list of all items).
To measure political orientation, we used a single item that had been used previously in all three languages, which read: "In politics, people sometimes talk of 'left' and 'right'.Where would you place yourself on this scale, where 1 means the left and 11 means the right?"(Breyer, 2015;European Social Survey Round 2019;Kroh, 2007).
To measure willingness to strike, we developed three items based on the study by Akkerman et al. (2013), which were rated on a five-point scale ranging from 1 = "Not at all" to 5 = "Very likely."The items were "I would strike for more money", "I would strike for better working conditions", and "I would strike for better working hours".Following recommended procedures (e.g., Schaffer & Riordan, 2003), we translated this scale from German into English and French via a translation-backtranslation process by two individuals per language who were fluent in both German and English or German and French, respectively.Any differences found between the original items and the back-translated versions were discussed, and agreement was reached on the most appropriate translation.
Table 1 shows descriptive statistics and reliability scores in terms of Cronbach's α and McDonald's ω (McDonald, 1999) for the G-SJS, willingness to strike, and political orientation for all three samples.

Procedure
At the beginning, participants chose their preferred language.After that, a welcome page explained the purpose of the study.This page was followed by socio-demographic questions.Here, participants who indicated that they were not currently employed were screened out.Next, all other participants filled out various scales including the G-SJS (Kay & Jost, 2003) and several others not of interest for present purposes.At the end, the participants were asked whether their data could be used for scientific purposes, were thanked for their participation, and returned to the online panel provider site to receive their compensation.
To evaluate ME, the standard three-step process using multigroup CFAs was followed (see also the SOM, S1 for further explanations on statistical analyses of ME; Hirschfeld & von Brachel, 2014;Vandenberg & Lance, 2000).As Δχ 2 highly depends on sample size, ΔCFI was used to compare models (Cheung & Rensvold, 2002): The equivalence hypothesis was rejected if changes in CFI of −.01 or more between the tested model and the less constrained model were observed.
T A B L E 1 Descriptive statistics and internal consistencies of the general system justification scale and willingness to strike scale for the three samples (N UK = 444, N DE = 454, and N FR = 463) Statistical analyses: measurement equivalence effect sizes ME analyses can be supplemented by effect sizes in order to assess the magnitude and practical importance of non-equivalence (Nye & Drasgow, 2011).We calculated so-called impact on the scale level and d MACS on the item level, as suggested by Nye and Drasgow (2011).In this study, we focused on the scale-level effect size of impact; all analyses regarding d MACS can be found in the SOM (S3) as well as a more detailed description on how to calculate impact (S1).On the scale level, impact can be calculated as an unstandardised effect size to assess the magnitude of non-equivalence.Impact reflects the constructrelevant differences in the measure (Ock et al., 2020).In order to estimate impact, one must first calculate Δmean.Δmean reflects the number of observed differences in mean composite scores between the assessed groups that can be attributed to non-equivalence.To estimate Δmean, a researcher sums up the differences in item means that can be attributed to non-equivalence between the referent and focal groups (Nye & Drasgow, 2011).Negative Δmean values indicate that differential item functioning (DIF, i.e., whether the items work in the same way in the groups, Janssen, 2011) results in higher means for the focal group than the referent group.Impact is then estimated as the difference between the observed differences and Δmean between the focal group and the referent group.

ST U DY 1 -R ESU LTS
Preliminary analyses can be found in the SOM (S2).Based on these, the reverse-coded items "British/ French/German society needs to be radically restructured" (Item 3) and "Our society is getting worse every year" (Item 7) were excluded from the following analyses because the CFAs of all three samples showed a barely acceptable fit when these items were included.
Before concluding that this should be interpreted as evidence against metric equivalence, we inspected the modification indices of the metric equivalence model and tested for partial metric equivalence by sequentially releasing individual loading constraints and retesting the model (Vandenberg & Lance, 2000).With respect to metric equivalence, releasing the constraints for items means that all items are still required to load onto the same factors in each sample, but the requirement that the loadings be of the same magnitude across samples can be dropped for some items (Vandenberg & Lance, 2000).The item "In general, the British/German/French political system operates as it should" (Item 2) had the highest modification indices.Thus, we relaxed the constraints on this item and tested again for metric equivalence.This model showed a little improvement (ΔCFI = −.011).As the change in CFI was still above the cutoff of −.01, we additionally released the constraints on the item "Most policies serve the greater good" (Item 5).Releasing the constraints for Item 2 and Item 5 significantly improved the model fit (ΔCFI = −.004).Following Steenkamp and Baumgartner (1998) and Vandenberg and Lance (2000), a factor can be assumed to be partially equivalent if more than half of the items loading onto the factor are equivalent.Hence, as the constraints for two of the six items were released, partial metric equivalence for the one-factor model was found.
Finally, we tested for scalar equivalence.Scalar equivalence implies that the item intercepts are also similar across groups.Hence, there should be no systematic response biases.This form of ME is necessary in order to meaningfully compare the means of the latent variables across different groups (Chen, 2008).The scalar equivalence model showed just acceptable fit, χ 2 (39) = 342.29,p < .001,CFI = .906,TLI = .89,RMSEA = .13,90% CI [.12, .14],SRMR = .07.Compared with the partial metric model, the change in CFI equalled ΔCFI = −.039,greater than the threshold of ΔCFI = −.01,indicating that the scalar model had worse fit than the partial metric model (Cheung & Rensvold, 2002).A further release of item constraints to test for partial scalar equivalence was not possible, as two of the six remaining items had already been released of their constraints, and releasing a third item would have led to half of the items being freed, which Steenkamp and Baumgartner (1998) and Vandenberg and Lance (2000) do not recommend.Thus, scalar measurement equivalence could not be established.
Taken together, we found partial metric equivalence, but no scalar equivalence for the scale in the three countries.Hence, we can meaningfully compare difference scores on the items across these three countries (Steenkamp & Baumgartner, 1998), but comparing means across the three countries is not warranted (but see the section below on effect sizes of non-equivalence).Remember that these results pertain to the already reduced scale after excluding two of eight items that led to poor model fit in preliminary analyses.

Effect sizes of measurement non-equivalence
To calculate the effect sizes of measurement non-equivalence, we first chose Great Britain as the referent group for comparing the samples from Great Britain and Germany and Great Britain and France because the original English version of the scale was used in Great Britain (following Nye & Drasgow, 2011).Next, we chose Germany as the referent group for the German-French comparison.The item "Society is set up so that people usually get what they deserve" (Item 8) was chosen as the referent item, following suggestions on how to choose a referent item by Nye and Drasgow (2011).These criteria state that n-1 tests of equivalence are conducted for each of the items in the scale.The only thing that varies across these tests is the item serving as the referent in each test.An item is chosen as referent item if this item is equivalent across each of the n-1 tests of equivalence.This resulted in three comparisons of five items; hence, fifteen item comparisons in total.
On the scale level, the impact for the British-German comparison was −0.83; this refers to the true construct-relevant difference between these two samples.This indicates that the German sample has higher system justification tendencies than the British sample.The impact was larger than the observed difference between these two groups (−0.55), indicating that non-equivalence inflated the British scores.The amount of observed difference that can be attributed to DIF was, hence, Δmean = 0.28 for this comparison (see Table 2).Δmean was in the opposite direction of the impact (impact = −0.83),indicating that non-equivalence reduced the observed differences in system justification relative to their true mean differences, thus resulting in an observed difference that was smaller than the actual construct-relevant difference (i.e., impact).The results can be found in Table 2. Results regarding the item level effect sizes of measurement non-equivalence can be found in the SOM (S3 and Table S2).
For the British-French comparison, the impact was 2.09.Hence, the British sample had higher system justification tendencies than the French sample.The observed difference between these two groups was 0.99, suggesting that the true difference between these two samples was actually larger than the observed difference indicated.The amount of the observed difference attributable to DIF was Δmean = −1.10.The negative value indicates that DIF resulted in a higher mean in the French sample than the British sample.Moreover, the Δmean for this comparison was negative, whereas the observed difference was positive (0.99).Hence, non-equivalence reduced the differences between the two samples by inflating the French scores.
For the German-French comparison, the impact was 2.91, indicating that the German sample had higher system justification tendencies than the French sample.The observed difference was 1.53, and the amount of the observed difference that could be attributed to DIF was Δmean = −1.39.This indicates that the DIF resulted in higher item means in the French sample than the German sample.As in the British-French comparison, the Δmean was negative, while the observed mean difference was positive.Thus, the true difference between these two countries was actually larger than the observed difference indicated, which is shown by impact = 2.91.Hence, the non-equivalence reduced the differences between the two samples by inflating the scores in the French sample.

Results RQ 2: convergent validity of the G-SJS
To answer RQ2a, that is, whether the associations between system justification and political orientation (i.e., conservatism) in the three countries could be replicated, correlations between political orientation and system justification scores were calculated for each sample.The system justification score was calculated without Items 3 and 7 as these two items had been excluded after the CFAs.Table 3 shows the results.System justification was significantly positively correlated to a right-wing political orientation in the British sample (replicating previous results).The same relation between these two constructs was also found in our French sample (contrary to previous results), but was less than half as strong, and failed to pass the significance threshold of p < .05.In the German sample, the correlation between system justification and political orientation was descriptively negative (contrary to previous results) and not significant, too.Thus, the RQ2a results were inconclusive.
RQ2b concerned the relationship between system justification and willingness to strike.For the French sample, a significant negative correlation was found (Table 3).The British and the German samples exhibited considerably smaller, non-significant negative relationships, although the correlation in the German sample was close to significance.Hence, RQ2b can be answered by stating that system justification was negatively related with willingness to strike in the French, but not in the British and German samples.

ST U DY 1 -DISCUS SION
When comparing all three groups, partial metric equivalence was reached but not full scalar equivalence.When considering two-country comparisons, we found partial scalar equivalence for the comparisons between Great Britain and France and Great Britain and Germany, respectively.Hence, in these pairings, comparisons of mean values were allowed; the comparison of mean values between Germany and France was not justified.Note, however, that we needed to exclude two reverse-coded items from the scale that negatively impacted baseline model fit and to additionally free several items from their constraints to achieve this partial scalar equivalence.In addition, the impact (i.e., the effect of non-equivalence on the scale level) of the comparisons between the French sample and the other two samples indicated that the observed scores were influenced by non-equivalence.The percentage of observed mean difference attributable to DIF ranged from 51% (UK-German comparison) to 111% (UK-French comparison).A percentage of 111% suggests that the effects of DIF are larger than the observed mean differences in this case.For the British-French comparison, Δmean was negative while the observed mean difference was positive.This shows that the referent group (the British sample) had a higher observed mean, but the effects of DIF lowered this effect by increasing the mean of the focal group (the French sample).Thus, the true difference was larger than the observed differences indicated, and conclusions based on the observed descriptive statistics might be affected.This also applies to the other two comparisons as 50% and 90% of observed mean difference attributable to DIF hinted at substantial influences of DIF in the scale values.On the item level, several items were considered problematic (see Table S3 in the SOM for further information).The relation between system justification and political orientation emerged only partly as expected (with higher political orientation scores indicating greater conservatism).Previous research found positive correlations in Great Britain (Zmigrod et al., 2018) and Germany ( Jost, 2019) and a negative correlation in France (Langer et al., 2020).In the present study, system justification was significantly positively correlated with political orientation in Great Britain, non-significantly negatively correlated in Germany, and non-significantly positively correlated in France.Thus, these results differ from previous findings for two of the three countries and are inconsistent with the assumption that conservatism is a "prototypical system-justifying ideology" ( Jost et al., 2003, p. 63).
Perhaps, the German participants perceived the status quo in their country as already rather left/liberal based on policies such as the rather pro-refugee political agenda of the German federal government at the time.Thus, German conservatives might have reported relatively low system justification motivation and been motivated to challenge the status quo, for example, by changing over a more restrictive refugee policy.Then again, Germany had been led by a conservative chancellor and party for 15 years at the time of data collection, speaking against the idea that Germans perceived their country as particularly leftist.Thus, there is no easy explanation for the descriptively negative correlation between system justification and political ideology in Germany.
The reasoning that the current status quo is perceived as rather left/liberal aligns with Langer et al.'s (2020) arguments with respect to France as these researchers also found a negative correlation between system justification and conservatism in their French sample.However, the French participants in our study did not follow this pattern.Additionally, some research has found a negative quadratic relationship between system justification and political conservatism in a European setting (Caricati, 2019).Hence, further research is needed to investigate the relationship between political conservatism and system justification as the association might not be as straightforward as previously assumed.
The relation between system justification and willingness to strike as a proxy for participation in collective action was negative in all three countries, as expected.However, these correlations were small and not significant in Great Britain and Germany.Thus, these findings only partially align with the assumption that people who are less system-justifying are less willing to accept inequality and to participate in collective action ( Jost et al., 2017).

ST U DY 2 -M ET HODS
In Study 2, we broadened the scope of our analyses to ensure that our results were valid and not due to peculiarities of Study 1.To this end, we used the dataset from Brandt et al. (2020), which contains 66 samples from 30 countries.The dataset is available at https://osf.io/qw47m/.Note that in Study 2, we used a different method to assess ME -the alignment optimisation method (Asparouhov & Muthén, 2014;Magraw-Mickelson et al., 2020;Marsh et al., 2018).This method is better suited for comparisons of more than three samples.On the downside, it does not allow for calculating ME effect sizes. 1

Sample
A total of 14,469 persons were included in the dataset, and of these, only those who had no missing values in the G-SJS scale (N = 13,494) were included in our analyses.The mean age was 25.05 (SD = 10.48), and 66.6% of the participants were female.See Brandt et al. (2020) for a further description of their studies (but note that some figures might differ as we did not exclude participants who had missing values in other constructs than system justification).Brandt et al. (2020) used the original eight-item G-SJS (Kay & Jost, 2003) and own translations to measure system justification.Items were rated on a self-developed seven-point response scale ranging from −3 = "Disagree strongly" to +3 = "Agree strongly".Note that this response scale differed from the one used in the original scale (and in Study 1) that ranged from 1 = "Do not agree at all" to 9 = "Agree completely".

1
We also used the alignment method for the data from Study 1 to ensure that the results were not confounded by the used statistical method.We obtained metric, but no scalar equivalence for the 8-item version as well as the 6-item version.The degree of non-equivalence across parameters was 45.8% of intercepts (8-item version) and 44.4% of intercepts (6-item version), respectively.Thus, the results corroborated those reported in Study 1 using the MG-CFA procedure.
We used the alignment optimisation method (Asparouhov & Muthén, 2014;Magraw-Mickelson et al., 2020;Marsh et al., 2018) to assess that ME that is well-suited for comparisons of more than three samples.We used the sirt-package (Robitzsch, 2020) to conduct the alignment optimisation method in R and followed recommendations of Fischer and Karl (2019) regarding settings of tolerance and alignment power.This method assumes rather approximate than exact invariance.Alignment optimisation starts with a common configural model that contains all groups (Magraw-Mickelson et al., 2020).In this configural model, the intercepts and loadings are unconstrained instead of a separate baseline model for each group as with the MG-CFA.Starting from this configural model, the process uses maximum likelihood (ML) estimation to fit an optimal set of measurement parameters and then computes approximations based on that.Based on the optimal model, means and variances of the latent variable can be computed.
The alignment method is executed in two steps: First, a configural model representing the best fitting model among all groups is established while fixing the factor means to 0 and fixing variances to 1, without constraining the loadings or intercepts (Magraw-Mickelson et al., 2020).Thus, alignment optimisation works with the assumption that there is a degree of non-invariance and its goal is to keep this non-invariance to a minimum.Second, the factor means and variances are freely estimated and undergo an optimisation process for every group factor mean and item parameter (Asparouhov & Muthén, 2014).When the minimisation point has been reached, a researcher can compare factor means and factor variances across groups using a "post-estimation algorithm" and identifies each model parameter such as loadings and intercepts that is significantly different from the average of that parameter across all groups.Those estimates that are significantly different are flagged as non-invariant.The output then provides the latent means based on this model plus the parameters flagged as non-invariant (Asparouhov & Muthén, 2014;Magraw-Mickelson et al., 2020).Thus, the alignment process allows for the estimation of reliable means despite the presence of some measurement non-invariance.As a threshold, Asparouhov and Muthén (2014) recommended 20% non-invariant parameters as acceptable.

ST U DY 2 -R ESU LTS
Preliminary analyses regarding the CFAs for every sample of the dataset can be found in the SOM (Table S4).We found metric equivalence for the overall sample, but no scalar equivalence.The degree of nonequivalence across parameters was 55.4% of intercepts (that are crucial to achieve scalar equivalence) and hence above the recommended threshold of 20% (Asparouhov & Muthén, 2014).Based on the results of Study 1, we also conducted the analyses without the recoded items 3 and 7. Again, metric equivalence was established, but no scalar equivalence, and the percentage of non-equivalent item parameters was 52.8% of intercepts.Further details and the calculations are available in the SOM (S1 and Table S4).

ST U DY 2 -DISCUS SION
Using the alignment optimization method, metric, but not scalar equivalence was established for the 30 countries in the dataset from Brandt et al. (2020) where a modified response scale was used.These findings are consistent with those from Study 1.They indicate that the validity of comparisons of G-SJS means across the assessed countries is questionable with the current version of the scale.

GEN ER A L DISCUS SION
The current article investigated the applicability of the G-SJS (Kay & Jost, 2003) in different countries by assessing measurement equivalence in samples from Great Britain, Germany, and France in Study 1 and across 30 countries in Study 2, as well as its convergent validity in Study 1.In Study 1, the onefactor model exhibited good fit in all three samples after excluding the two reverse-coded items, and the G-SJS was partially metric equivalent across the three countries.Partial metric equivalence allows for meaningful comparisons of difference scores on the items across the three countries (for an overview of practical questions regarding ME, see Table 4; Steenkamp & Baumgartner, 1998).We did not find scalar equivalence across the three countries, indicating that the construct mean values in the three countries cannot be readily compared.Focusing on bilateral comparisons allowed us to calculate effect sizes of measurement non-equivalence.For some comparisons, these were substantial.
In Study 2, we also found metric, but not scalar equivalence across the 30 countries using an analysis method well-suited for comparisons of larger numbers of samples (Asparouhov & Muthén, 2014).Note that this method did not allow to estimate effect sizes of non-equivalence.The convergent validity analyses in Study 1 were rather inconclusive as the relation between system justification and political orientation showed only the expected direction in the British sample.Furthermore, the relationship between system justification and willingness to strike was only as expected in the French sample.
What are the substantial implications of the present research for researchers not specifically interested in measurement issues, but in using the scale in their research?Based on the results of our two studies, we do not recommend calculating an ANOVA to compare scale means of several countries.Due to lack of scalar equivalence in the present samples, these comparisons might be distorted by construct-irrelevant variance in other samples as well.Hence, comparisons such as those made by van der Toorn et al. (2010) between a U.S. and a Hungarian sample should be interpreted with caution.For such comparisons, the ME should first be established.The same applies to comparisons that are made T A B L E 4 Questions and answers regarding further use of the general system justification scale

Question Answer
May I calculate an ANOVA for the scale means of several countries?
The lack of scalar equivalence does not permit an ANOVA to be conducted.
Comparisons would be distorted by this construct-irrelevant variance.
May I compare the scale's correlations with external criteria across countries?
As scalar equivalence is also a prerequisite for interpreting differences in variances, the answer is no.With differences in variances, we mean the variance in mean G-SJS values within a given sample, which in turn enters the correlation.
May I assume the same factor structure of the scale in different countries?
The one-factor model exhibited good fit in all three countries in Study 1.
Insofar, the answer is yes.Nevertheless, the items used to assess this factor do not seem well-suited for every country (which can be seen in Table S4).Future research should always start with a CFA for each sample, followed by a test of measurement equivalence via multigroup confirmatory analysis or alignment optimisation.
May I assume that all items of the scale are similarly important in different countries?
First, the negatively worded Items 3 and 7 should be improved as they impaired the model fit in all three samples in Study 1.However, in Study 2, they did not have a huge impact on the non-equivalence.Second, the other items showed different effect sizes of non-equivalence in the three comparisons in Study 1; therefore, which items are similarly interpreted and which are not seems to depend on the specific samples in question.This was also found in Study 2 where Item 4 had most unique item parameters per item, followed by Items 6 and 3.
between capitalist and post-Communist societies (Cichocka & Jost, 2014) and the multilevel analyses comparing the relations of subjective status and perceived legitimacy across countries as in Brandt et al. (2020).If measurement invariance is not established before interpreting the results, one cannot be sure whether the observed differences might not (at least partly) be based on improper translations or different interpretations of items across groups.The same applies to the comparison of the scale's correlations with external criteria across countries.This is why we did not compare the correlations between the system justification scores of the three countries' political orientation and willingness to strike across the three countries in Study 1. Instead, we focused on interpreting each correlation in isolation.To make these implications easily accessible, Table 4 summarises pertinent questions for further use of the scale along with answers based on the results of the present studies.
Our statistical results also align with previous criticism of the scale from a theoretical standpoint: Owuamalam et al. (2019) criticised that some items measure rather national attachment, which is a theoretically different construct than system justification.National attachment has also been shown to vary between countries (Becker et al., 2017).Hence, differences in these items might not be based on differences in system justification, but rather in the concept of national attachment to the respective group.In accordance with this, we found that the item "The United Kingdom/Germany/France is the best country in the world to live in" exhibited the most unique item parameters in Study 2 and should, thus, be considered for further improvement.Hence, the earlier theoretical critique of this type of item and our empirical results independently led to similar conclusions.These should be considered for the further development of the scale.
Another point regarding the items is that we had to exclude the two negatively worded items before conducting the MG-CFAs in Study 1 to achieve a satisfactory model fit in all three countries.This highlights the common problem in ME testing that negatively worded items are often interpreted differently across countries, which might distort the intended unidimensionality of a scale by creating a separate factor (e.g., Herche & Engelland, 1996;Lindwall et al., 2012;D. P. Schmitt & Allik, 2005).Based on Study 1, the two negatively worded items should be considered for further improvement or discarded from the scale altogether (both alternatives would, admittedly, require new validation work, Flake et al., 2017).Note, however, that these negatively worded items did not pose particular problems in Study 2.
Not only the statistical but also the practical impact of ME findings (i.e., effect sizes) should be considered (N.Schmitt & Ali, 2014), which was possible in Study 1.The effect sizes showed that the nonequivalence on the scale level between France and the other two countries was particularly substantial.For example, the non-equivalence resulted in item means that were 1.39 standard deviations higher in the French sample than the German sample.Hence, changes to the scale items might be required before the scale can be used for valid cross-country comparisons.
Finally, the aim of the present research was to elucidate that considering ME is an important issue when working with the G-SJS (Kay & Jost, 2003).To this end, we conducted cross-country comparisons of system justification as measured with the G-SJS.However, the implications of our findings go well-beyond cross-country comparisons.To be sure, such comparisons comprise only a small part of research on system justification.Importantly, the same logic applies to all kinds of comparisons of system justification tendencies across groups as measured with the G-SJS, within the same country or across countries (e.g., different ethnic groups, high-and low-status groups, Jost et al., 2003).Thus, future research should also assess whether the G-SJS is measurement equivalent in such comparisons and not only across countries.

Limitations and future research
Several limitations need to be mentioned.First, although the samples in Study 1 were comparatively large (Ns > 440), they had other limitations.This is evident from the fact that a considerable number of participants had to be excluded -for reasons such as too swift completion or selecting identical response options across a large number of consecutive items.By excluding these participants, we tried to enhance the quality of our samples and ensure that our analyses are valid.Nevertheless, it might be useful to further validate our results in other, ideally representative, samples for each country.Second, Brandt et al. (2020) used a different response scale than the original G-SJS and the one we used in Study 1.Despite the converging evidence, in terms of ME, the results might be affected by this difference between studies and also by the difference between the used analysis methods.To address this issue, we conducted the analyses from Study 1 with both the MG-CFA and the alignment method and obtained comparable results, hinting that this should not be a major problem (see SOM, S4).
Future research on the G-SJS (Kay & Jost, 2003) and system justification motivation, in general, could address several areas.In particular, the results from the German sample (i.e., the negative relation between system justification and conservatism) should be replicated.If found again, researchers should investigate the reasons for this unexpected relation.It would also be interesting to assess whether this negative relationship emerges in additional countries, particularly countries that are known to have a rather left/liberal status quo.Finally, further research could also investigate the differences we found between Germany and France in more depth to disentangle whether these differences are due to lack of ME (i.e., different item formulations or item understandings) or real differences in the conceptualisation of system justification.

CONC LUSION
The main aim of the present research was to assess the measurement equivalence of the G-SJS (Kay & Jost, 2003).Based on two studies (Study 1: three countries, Study 2: 30 countries), our results indicate that it is not justified to compare mean values across countries -in fact, the scale-level effects of nonequivalence were quite large.Thus, caution is required when comparing samples from different countries using the G-SJS in its current form.

AC K NOW L E DGM E N T
Open access funding enabled and organized by ProjektDEAL.

C ON F L IC T OF I N T ER E S T
We have no conflict of interests to disclose.We would like to thank Keri Hartman for proof-reading this document.

OPE N R E SE A RCH BA DGE S
This article has earned an Open Data, for making publicly available the digitally-shareable data necessary to reproduce the reported results.The data is available at https://osf.io/34p9j/?view_only=82053b c8f56e443bbab03bc92c14077f.
Effects of non-equivalence on scale-level properties Items were z-standardised.Impact refers to the true differences in the construct.Negative values in Δmean indicate that DIF (differential item functioning) results in higher means for the focal group than the referent group.UK = Great Britain, DE = Germany, FR = France.
T A B L E 2Note: a Referent group.
Correlations between system justification, political orientation, and willingness to strike in the three samples UK = 444, N Germany = 454, N France = 463.UK = Great Britain, DE = Germany, FR = France.Higher values in political orientation correspond to a more conservative political ideology.The superscripted numbers represent the significance level of the corresponding correlations.
T A B L E 3Note: N