Demographic Inference and Representative Population Estimates from Multilingual Social Media Data

Social media provide access to behavioural data at an unprecedented scale and granularity. However, using these data to understand phenomena in a broader population is difficult due to their non-representativeness and the bias of statistical inference tools towards dominant languages and groups. While demographic attribute inference could be used to mitigate such bias, current techniques are almost entirely monolingual and fail to work in a global environment. We address these challenges by combining multilingual demographic inference with post-stratification to create a more representative population sample. To learn demographic attributes, we create a new multimodal deep neural architecture for joint classification of age, gender, and organization-status of social media users that operates in 32 languages. This method substantially outperforms current state of the art while also reducing algorithmic bias. To correct for sampling biases, we propose fully interpretable multilevel regression methods that estimate inclusion probabilities from inferred joint population counts and ground-truth population counts. In a large experiment over multilingual heterogeneous European regions, we show that our demographic inference and bias correction together allow for more accurate estimates of populations and make a significant step towards representative social sensing in downstream applications with multilingual social media.

: Our two-stage approach debiases nonrepresentative Twitter data by (i) inferring demographics with state-of-the-art accuracy from multilingual, profileonly data and (ii) learning inclusion probabilities to create more representative samples for Europe-wide subregions

INTRODUCTION
Data representing the attitudes and behaviors of a (national) base population is of great importance to policy making, social science research and commercial prediction tasks, but representative surveys are expensive and infrequent. Social media data has been proposed as a real-time and inexpensive way to measure social phenomenathis area of research is known as "social sensing." Social media data has been used, with various levels of success, in areas such as infectious disease [30,31,51], migration and tourism [7,50,81], and box office takings for films [21,55].
However, despite social media data being easily accessed, large in scale, and detailed, it is generally not representative of indicator variables measured in any broader (offline) population due to various biases [43,67]. Especially the self-selection of users joining and participating on any given social media platform is a large source of error. In the UK for instance, Twitter users are more likely to be male and young compared to the national population [71], and in the U.S. men and residents of densely populated areas are overrepresented [35], while there exists a mix of over-and undersampling on Twitter for users with specific racial backgrounds [56]. In general, however, the exact inclusion probabilities for a specific demographic group on any platform in a given country or region are unknown. And while not all social sensing tasks require representative data [3], recent studies have pointed out the fallacies in predicting phenomena via social sensing without controlling for sampling biases in social media data [28,29,44], with research applications aiming to draw inferences for nation-wide target populations often relying on such attribute data. Consequently, a systematic methodology for estimating them is needed.
Survey analysis researchers have dealt with non-representative polls and non-response through sample re-weighting-with poststratification being a well-known technique [10,37]. These methods make use of basic demographics attributes: most prominently age, gender and location [60,77]. Where demographic details are provided, post-stratification has been a valuable technique [77,81]. However, most social media platforms do not provide demographic data about their users, making it difficult to know or correct for such biases. Although promising techniques for extracting demographic attributes from social media have been proposed, there exists no robust multilingual approach for this task. We address these challenges by inferring basic demographic attributes and correcting for selection biases on a large sample of multilingual profile data from Twitter (Figure 1), one of the most used platforms for social sensing given its ease for obtaining data.
This work offers the following three contributions. First, we introduce a new multilingual, multimodal, multi-attribute deep learning system for inferring demographics of users ( §2). This pipeline is built from the ground up to enable inference in 32 different languages, addressing the need for such methods beyond English and outperforming state-of-the-art methods on the tasks of predicting the age, gender and is-organization state of Twitter users. We show this system reduces algorithmic bias with respect to skin tone over the best-performing commercial systems. Second, we formalize a statistical framework for models that debias non-representative samples ( §3). Our models explicitly learn per-stratum inclusion probabilities from data, whereas typical post-stratification methods either assume their knowledge or focus on obtaining a poststratified estimator of a response variable. Third, in real-world evaluations, we show that our debiasing models significantly reduce the error rate in comparison to a model without post-stratification on the inferred demographic attributes ( §4). In these ways, we make a significant step towards representative social sensing in downstream applications using multilingual social media.

DEMOGRAPHIC INFERENCE
We propose a new demographic inference model for three attributes: gender, age, and a binary organization indicator ("is-organization"). The first two attributes, gender and age, are widely reported in census data and are core features to accurately measure demographic biases. The third attribute was selected based on a known confounder for people-based studies of online platforms: the presence of accounts belonging to organizations [2,54], which shall be distinguished from individual accounts. Next, we describe the model, its data, and its evaluation.

Classification Task
We consider the task of estimating population count in Twitter given a stream of user tweets. This scenario is motivated by the common research use case for social sensing on Twitter, where a researcher consumes one of the Twitter streams (e.g., the 10% or 1% sample streams) and desires to make some analysis from the tweets of a broader phenomena that extends beyond Twitter. In  Figure 2: The M3 model for inferring gender, age, and organization-identity from image and text data. such a scenario, many users are expected to appear rarely-possibly only once-which hinders the use of existing demographic methods that rely upon large amounts of text to infer gender [13,17,18,25,45,49,72,73,82] and age [33,58,59,63,65,68].
Instead, we avoid making inferences based on tweets and focus on using only the information associated with a user's account, which allows our method to easily scale to large volumes of users without the need for significant quantities of text. Not stratifying users based on tweets has the added advantage of preventing downstream demographic biases to any social sensing tasks that analyze the same language. Ultimately, our method uses four sources of information, username, screen name, biography, and profile image, and is designed to operate effectively on multilingual data.
We formalize the three classification tasks as follows. Gender and is-organization are modeled as binary classification tasks. 1 Age is known to be a difficult inference task in social media [58,82]. Here, we opt to categorize age in four levels: ≤ 18, (18,30), [30,40), [40,99). Our choice is motivated by (i) our ability to align these age ranges with census data reports and surveys [74,75], and (ii) the difficulty of the task (even for humans) when finer-grained age classes are used [23,76], while (iii) still making the resulting classes amenable to downstream tasks. While other attributes are possible for use in debiasing (e.g., education, income, etc.), ground truth population statistics for these attributes are not widely available in censuses, making it difficult to train and evaluate any model.

Method and Training Procedure
Demographic attributes are often expressed in both image and text. We propose a new multimodal inference model with a novel training technique that leverages both of these sources to significantly augment our predictive accuracy. We first describe the model and each modality and then discuss the training procedure.

2.2.1
Model. The full model comprises two separate pipelines for processing a profile image and each of the three text sources of information, as shown in Figure 2. Here, we use a shared pipeline for all three attributes, using multi-task learning for the output. We refer to this as the M3 model, after its multimodal, multilingual, and multi-attribute abilities. Following, we describe the architectures of each modality and how they are combined. Image Model The image classifier was constructed using DenseNet [38] based on initial performance experiments over current stateof-the-art vision models on our data. We scale all profile images to 224x224 to meet the input size of the model. Text Model Separate character-based neural models were trained for each text input and then a multi-input model was tuned based on the weights from each single model. Embeddings for username and biography were limited to the 3000 most frequent characters in our training corpus, with remaining characters represented with an encoding for their unicode category; screen name are embedded in the ASCII range. We use 2-stack bidirectional character-level Long-short Term Memory (LSTM) architecture [36] to capture necessary information. To incorporate potential different meanings in different languages for the same character, we concatenate a separate trainable language embedding to each character embedding and add a fully connected layer before being fed into an LSTM. The fusion of character-based input with language embeddings provides two critical benefits that let us scale to the linguistic diversity seen in global platforms like twitter. (1) Character-based RNNs are able to capture morphological regularity within a language and substantially reduce the parameter space due to the reduced vocabulary size of the model (i.e., the set of characters) as compared with word-based models with orders-of-magnitude larger vocabularies [15,27,46]. (2) By fusing language embeddings with character embeddings, the joint embedding lets us represent shared linguistic regularity across languages, which lets us better scale to the multilingual environment. Full Multimodal Model To construct the full model, separate models are fully trained for image and text for all attributes. Then, the softmax output layer of each is removed and the two models are joined into a new model with a modality drop-out layer (described next), two fully connected layers of size 2048, a Rectified Linear Unit (ReLU) layer [57], and separate output layers for each task. This full model is then fine-tuned using the pretrained single-modality models' parameters for initialization.

Training
Procedure. Rather than training M3 end-to-end initially, we combine three procedures for the training process: cotraining, multilingual data augmentation, and modality drop out. Co-Training Both image and text can have clear signal of demographics. Given the potentially-limited training data for speakers of less common languages, we take advantage of the multiple views of the same users by incorporating co-training: a semi-supervised learning technique. Co-training first learns a separate classifier for each view using any labeled examples. The most confident predictions of each classifier on the unlabeled data are then used to iteratively construct additional labeled training data [11]. Ardehaly and Culotta [4] recently showed co-training was effective for demographic classification in social media data using images and text in a graph-based co-training learning procedure.
In comparison to the variance of gender and age expressions in textual data across sociolinguistic contexts [6,70], image data provides a more universal depiction, acting as an analog to a universal pivot language in machine translation [80]. Thus, we train the image-based portion of the M3 model first, which identifies high confidence labeled data from an unlabeled multilingual dataset. Second, the text-based portion of the M3 model is trained using the original dataset plus the text data associated with any highconfidence image-based classifications. Finally, we train the full M3 model with the training dataset and all high-confidence predictions from the unlabeled data. All models are trained as multi-task, where a common input representation and architecture is used to predict each output tasks. The probabilities for each task were represented using a top softmax layer. Multilingual Data Augmentation The M3 model is designed to operate in online environments on tens of languages. However, few approaches and little labeled data exists for this task for languages other than English, with some exceptions [16,32,62]. Even with co-training, the relative differences in number of speakers between languages (e.g., English vs. Slovenian) create a substantial class imbalance which would prevent the model from effectively learning signals of each attribute in users' biographies written in lessfrequent languages. Therefore, we perform data augmentation by using automatic machine translation of the training data.
Given the scale of our training data (described later in §2.3), the use of online translation services is prohibitive both in terms of time and cost. Instead, we opt to use word-based translation using the MUSE bilingual dictionaries [20]. For each instance in the training data, all words in the bilingual dictionary are replaced with the translation. A translated instance is only kept if 80% or more of its words could be translated. While word substitution typically produces translations that are more incorrect and less fluent, we observed that the remaining biography translations were often of reasonable quality due to the declarative nature of biographies, which often consist primarily of lists of readily-translatable noun descriptions or attributes of the user without using full grammatical sentences. Such translations also largely capture topical and semantic regularity, even if the morphology is not precise. Modality Drop-out The co-training model uses a conjoining architecture with pre-trained weights from the image model and the multi-input text model. To facilitate the model using each source of information, we introduce a new technique, modality dropout, that can fully hide a source of information (e.g., the biography) during training. We add an extra input dropout with a probability of 0.25 (i.e., keep 3 out of 4 input sources on average) to better regularize the model and support use cases where not all 4 input sources are available. When doing dropout, images were replaced by a random matrix while text was turned to a specific empty embedding.

Data
Training and evaluating M3 uses five distinct datasets: (1) a large dataset of Twitter profiles whose gender and age are heuristically identified, predominantly in English, (2) a curated dataset of Twitter accounts belonging to organizations, (3) an image dataset of faces from IMDB and Wikipedia, (4) a massive unlabeled set of users, and (5) a crowdsourced dataset for all three attributes that spans all 32 languages in our study. We describe each next.

Heuristically-identified users.
Heuristically-identified users were drawn from a 10% sample of Twitter from 2014 to 2017 on Twitter, where roughly 40% of the users are English speakers [34]. Here, we identify gender and age from fixed expression in users' biographies that signal age or gender, e.g., "mother of two wonderful kids" or "26 y/o dude" using a gender-or age-signaling word. Age data was further augmented by identifying tweets wishing another user a happy birthday (cf. Al Zamal et al. [1]). These pattern-based approaches are known to be highly precise for certain attributes when properly constructed [8,9]. Patterns were created for five languages: English, Spanish, French, German, and Swedish. 2 Organization accounts were manually curated by identifying 676 Twitter lists that primarily contained organizations, such as non-profits, local clubs, companies, or municipal services. These lists resulted in 59.92K unique organization accounts. For heuristically-identified users and organization accounts, we collect their current profile image, screen name, username, and biography. User age is adjusted to the present day. For heuristically-identified users, the biography is altered to replace the gender-or age-indicating word with a special token so that the model is forced to look for additional cues to recognize each attribute. The IMDB-WIKI dataset [66] consists of 523,051 images from headshots of actors in the IMDB and profile pictures in Wikipedia. This dataset provides an auxiliary source of information for fine-tuning the image-based part of the M3 model.
These three datasets constitute our initial data prior to performing the three-step annotation procedure. Table 1 shows the detailed statistics of the data for each step.
Co-training depends on having access to a large set of unlabeled users whose high-confidence labels from one modality (e.g., profile image) can be used to augment the training set for another modality. Here, we collect 36.97M profiles speaking one of the 32 languages commonly spoken in Europe but for which we have no groud truth label. Languages were identified using CLD2 [53]. Note that users in this unlabeled dataset will only become training instances if one view's classifier (text or image) labels them with high confidence.

Crowdsourcing
Data. Current gender, age, and organization datasets are nearly all produced for English-speaking users, which prevents us from evaluating the M3 model in the multilingual environment. Therefore, we constructed a new dataset of up to 200 randomly-selected Twitter users speaking one of each of the 32 languages in our study. Three annotators on Figure Eight 3 were shown the usernames, screen names, short biographies, and profile images of each Twitter account. We instructed workers to determine gender and age of a given user from two drop-down menus based only on this information. The selection options regarding gender also included a category for profiles belonging to multiple persons, and to a non-person (e.g., organization/bot). Regarding age, annotators were given seven categories to choose from: <= 18, (18,30), [30,40), [40,50), [50,60), [60,70), >= 69. They could select "don't know" for all questions, if they felt not enough information was available. We employed Figure Eight's selection mechanism for speakers of the target language of the set to be annotated, or chose workers by their country of residence, if one of that country's main spoken languages was identical with the target language of the task. Additionally, each job included a set of 30 test questions, annotated as a gold standard by the authors. These were used to screen for spam but also for any non-speakers of the target language remaining after our first filter; to this end, at least 10 questions were only solvable by sufficiently understanding the target language. Annotation reliability was measured for each Figure Eight job using Krippendorff's α [48]. For the computations of α, we excluded all English test-questions from the 31 non-English jobs, profiles with only one annotation, and included "don't know" answers as an answer option. To establish a measure of how well workers did on the age classification in respect to the age brackets classified by M3, we collapsed age bracket encodings over 40 into one category. Across all languages, the mean α for the gender annotations lies at 0.81 and at 0.57 regarding the age brackets, hinting at the high difficulty of the Twitter profile age classification task for humans. For a separate measure, we recoded all gender and organization annotations into the two categories "is organization" and "is not organization" and calculated Krippendorff's α based on these data, resulting in a mean of approximately 0.75 across all languages, showing that workers could identify non-personal accounts reasonably well. The final dataset is constructed from all instances where at least two annotators agreed on the label for a particular attribute (majority vote), resulting in 4,732 profiles across 32 languages.

Data Partitions.
All self-reported labeled data is partitioned into 80% train, 10% development, and 10% test splits. No official train/test/dev splits are available for the IMBD-WIKI datasets so we perform our own partition with the same percentages as for self-reported. During model training, only the development set for self-reporting users are used for model selection. Owing to its small size, crowdsourcing data was only used for testing purposes. For the organizational data, we create a full dataset with non-organizational accounts by randomly sampling accounts from the heuristically labeled data to attain a 1:9 ratio, following the report in McCorriston et al. [54] that organizations make up 10% of the accounts on Twitter.

Co-Training Setup.
The training data is balanced for each stage, where we oversampled gender and organization status, and undersampled the two most-frequent classes (0-18 and 19-29) while oversampling the other two age classes. High confidence thresholds were set to 0.9 for gender and organization status in both the image and text modalities. As a four-class classification task, the highconfidence for age is set to 0.7 in the image classifier, based on early examination of the prediction quality. We observed that age is more readily predictable in text and therefore used a threshold of 0.9 to keep quality high while also providing a sufficient number of new accounts for co-training.
Separate image and text models are each trained for 10 epochs. Due to the imbalanced number of instances per attribute (Table 1), each epoch contains 10000 steps where each step contains a minibatch of 128 instances balanced across the three attributes. This setup ensures the model sees instances of the smaller datasets (age and organization). The image model is trained first, and its highconfidence instances are then used to train the text model. The final co-training model was trained for 5 epochs with input dropout and another 3 epochs without input dropout for better convergence. Amsgrad optimizers [64] were used in all training process, with a learning rate of 0.001 for ground-up models and 0.0005 for finetuning models. Text models were trained on one GPU, while the image and co-training models were trained in parallel on 4 GPUs. The full pipeline was built in PyTorch [61].

Evaluations
Accurate inference of representative attitudes from social media relies upon accurate stratification approaches. Here, we test the M3 model against current state of the art systems to describe the potential for error to affect demographic bias.

Comparison Systems and
Data. The M3 model is compared with state-of-the-art systems from each modality and attribute. For images, we compare with Face++ [42] and Microsoft Face API [5] on age and gender performance. For text, we compare with three current state-of-art systems for inferring gender from usernames, genderperformr [78], demographer [47] and Jaech et al. [39], which could feasibly operate on names from multiple languages.
No multilingual organization recognition system exists; so, we limit our evaluation to the only publicly-available dataset of personvs-organization which was scored by the Humanizr [54] and Demographer [79] systems. This data consists of a uniform sample across Twitter accounts, which were approximately 10% organizations. Here, we recollect the user profiles for their 20,273 accounts, of which 18,587 (91.6%) were still available as of October 2018, which matched the distribution originally reported. Their model was evaluated on the dataset using cross-validation, whereas we treat the entire data as a test set and report performance.

Gender
Recognition. M3 produces state-of-the-art gender recognition for each of the three image-based datasets (IMDB, WIKI, and Twitter), as seen in Figure 3. M3 provides significant improvements in performance (F1) at each comparison system's level of recall. 4 The one exception to this trend is the performance of the MSFT classifier on the WIKI dataset, which has highly similar performance to M3. Given the MSFT's model large jump in performance, we hypothesize that their model may have been trained on Wikipedia data, although this cannot be confirmed. We observed that the increase in coverage of M3 is due in part to the model learning non-facial attributes associated with more than one gender, e.g., certain sports with the male gender, which allow M3 to process profile photos that other models cannot.
On the text-based data, M3 outperforms all systems except one when using just the username and masking all other information, shown in Table 2. When the model is allowed to see all other text information, M3 performance improves substantially, indicating it can effectively fuse several sources of information. While this latter setup uses more information than comparison systems, in practice, the Twitter API includes all text information used by M3 by default; so the only additional step is downloading the profile image.

Age Recognition.
Age recognition is a significantly more difficult task in social media, as seen by the performances in Figure 4. The M3 model offers similar performance to commercial models on the IMDB and WIKI datasets, which primarily feature face-forward headshots in good lighting. However, in real-world Twitter profiles (shown in blue), the M3 model substantially outperforms both in F1 at each of their respective coverage levels, with 0.16 and 0.11 absolute improvements over Face++ and Microsoft, respectively.

Organization Recognition.
For any task that draws inferences about human actors, non-human and specifically organizational  [78] 0.835 Demographer [47] 0.781 † Jaech and Ostendorf [39] 0.763 accounts are a major source of error. The results of Table 3 show that our model is able to identify these accounts with 16.3% higher accuracy than the next closest system with no drop in performance at recognizing humans. Results on our test set from our organization data ( §2.3) in Table 4 (top right) show similar performance, with an overall F1 of 0.898 that indicates high performance for both classes.

Multilingual Performance.
As a full test of M3, we evaluate it against our crowdsourced account labels in 32 languages, shown in Table 4 top right. Performance in this multilingual setting is on par with the performance on the primarily-English heuristicallylabeled data, though age classification remains the most difficult task. Figure 5 shows the performance per language for each attribute. Performance at predicting gender and is-organization are similar for most languages, while age performance varies significantly by language from 0.28 F1 for Bosnian to 0.73 for Slovenian and Welsh. These results together indicate that M3 is sufficiently accurate in the multilingual European environment.

Ablation Study.
To examine which parts of the system are contributing to performance, we perform an ablation study by (i) restricting the model to one modality and (ii) training a model without using co-training or translation for data augmentation. We test each model on (i) the heuristically-labeled and organisation data and (ii) the multilingual data, which is representative of the performance on the data used in our downstream task. The results shown in Table 4 reveal two main trends. First, the model benefits from both modalities. The removal of textual information causes the biggest performance drop. Second, although co-training and translation hurt performance in the heuristicallylabeled data, they produced a substantially better model when evaluated on multilingual data. Since the heuristically-labeled data is primarily in English, translation potentially adds noise and forces  the model to represent a larger space of inputs that are not represented in the test data. This performance difference indicates that the two data augmentation techniques are highly beneficial when bootstrapping a model from mostly monolingual data.

Test for Algorithmic Bias.
Image-based models for gender recognition are known to suffer from algorithmic bias in recognizing darker-skinned individuals [12]. This bias is thought to be a result of non-representative training data the underreprentation of darker skin tones in the training data. As our focus is on European countries, such a bias could be present in the M3 data. However, because our co-training procedure uses a large unlabeled set of users from across the globe, these users can potentially provide a more representative sample and reduce algorithmic disparity. To test for algorithmic bias, we compare the M3 model's performance on the Gender Shades dataset [12], which contains 3964 gender-annotated facial images balanced across light and dark skin tones. In this earlier algorithmic audit, the Microsoft (MSFT) and Face++ gender classifiers performed substantially worse on women, especially darker skinned women.  The gender inference performance, shown in Table 5, reveals that M3 has substantially less algorithmic bias relative to the two best-performing commercial systems also tested on Gender Shades. Our model significantly improves performance on dark-skinned women in comparison to the other two systems which had the largest performance disparity on that demographic. However, M3 is least accurate on darker-skinned males, which indicates that additional work is needed to reach performance parity in skin tones. Nevertheless, this analysis also indicates that M3 is more suitable than existing systems for operating in the global environment and increases the robustness of downstream social sensing applications.

LEARNING INCLUSION PROBABILITIES
In social sensing studies, the results of measurements performed on a given platform, e.g., a social media site, are studied to understand the behavior of a population. Often, such measurements are biased [67], as individuals with certain demographics are more likely to join these platforms, e.g., young people may be more likely to join Twitter. Obtaining representative estimates in these scenarios is challenging, as the probability of an individual with given demographics to be on a given platform, also referred to as inclusion probability in sampling methodology, is typically unknown.
Here, we estimate these inclusion probabilities as a function of demographics. To this end, we learn the debiasing coefficients on the grounds of statistical survey analysis with missing data [10,52,69]. Specifically, we derive estimators for the numbers of individuals by making an assumption that per-strata inclusion probabilities are equal for individuals within a group (e.g., a country) and different between groups (e.g., between countries). This assumption allows us to get both across-groups global debiasing coefficients (i.e., the biases of Twitter users) and the group-specific debiasing coefficients (i.e., per-country demographics on Twitter, e.g., developed countries may have more and diverse users on Twitter). Our approach learns the inclusion probabilities, while typical post-stratification methods either assume these probabilities are known or focus on obtaining a post-stratified estimator of a response variable.

Formulation of Debiasing Models
Consider a population U of N = |U | individuals with certain demographics. For simplicity, we focus on the case of only two discrete demographic variables, say age a and gender д, which are distributed following P N (a, д), but the following reasoning applies to other demographics as well. Out of these N individuals, say, M ≤ N joined a certain online platform with a probability depending on their demographics. For instance, individuals joined Twitter and younger people were more likely to join than older ones.
In survey analysis, the probability that an individual with certain demographics joined a certain platform corresponds to the inclusion probability of the stratum representing these demographics [10,69]. In the simplest scenario, when the probability of joining the platform is homogeneous in time and across individuals, this probability can be expressed as the ratio of the number of Twitter users with the demographics a and д to all individuals, i.e., π (a, д) = M (a,д) N (a,д) = M P M (a,д) N P N (a,д) , where P M (a, д) and P N (a, д) are the distributions of demographics of Twitter users and the overall population, respectively. However, the inclusion probability may vary between individuals. To account for this, we discuss the homogeneity of inclusion probability for a given partition of the population. A partition of the population U is defined through non-overlapping and non-empty subsets U i of the population that together constitute U , i.e., i U i = U . Typically such subsets will have a certain meaning, e.g., a natural partition of a population is a split by countries, regions, or cities. The total number of individuals in the subset i is N i and the number of individuals having particular demographics is N i (a, д).
3.1.1 Homogeneous bias. The inclusion probability, which governs the bias, is homogeneous with respect to a given set of demographics and a given partition of the population, if and only if π i (a, д) = π (a, д) for each subset i of the partition, i.e., the inclusion probability does not depend on the elements of the partition.
If the inclusion probability is homogeneous, then we can write Thus, to obtain the inclusion probabilities, we regress N i against M i (a, д). If the condition of homogeneity holds, then the regression coefficients are equal to 1/π (a, д).

Inhomogeneous bias.
If the inclusion probability is inhomogenous for given demographics and partition, then we shall model each subset of the partition separately or use a different model that relaxes the homogenity assumption by specifying the functional form of inhomogeneity. Here, we consider the inhomogeneity of the form where ν is an unknown exponent and f 1 and f 2 are unknown functions. 5 Under this assumption, log N i (a, д) = log . This time, to obtain the debiasing coefficients and corresponding inclusion probabilities, we regress the ground-truth population, log N i (a, д), against the biased measurements, log M i (a, д), and the demographic indicator variables, i.e., where δ i j is Kronecker delta, i.e., δ i j = 0 if i j and δ i j = 1 if i = j, β 1 = (1 − v), β a = − log f 1 (a), and β д = − log f 2 (д). From these regression coefficients we can obtain the inclusion probability π i (a, д) of a set of samples i with the demographics a and д via Equation 2. This debiasing model has been proposed recently by Zagheni et al. to predict the number of migrants based on Facebook data [81], but its derivation and explicit formal interpretation have not been provided until now, to the best of our knowledge. Note that this estimation method of inclusion probabilities requires the ground-truth joint counts N i (a, д) at the time of training, whereas the model based on the homogeneity assumption requires only the total counts N i . The availability of the joint counts is often limited in practice because of insufficient number of samples per stratum.

Discussion
We derived debiasing models by making homogeneity or inhomogeneity assumptions. These models predict population size, so they can be evaluated by measuring the error of predictions in crossvalidation settings or via model selection methods. In this way, we test which assumption is closer to reality and find the most accurate method for the given dataset. We evaluate these models in the next section using the population of regions in EU countries.

EUROPEAN POPULATION INFERENCE FROM TWITTER DATA
Here, we use the debiasing models formulated in the previous section to obtain the debiasing coefficients and the corresponding inclusion probabilities for Twitter users of different countries. To this end, we regress the country-level ground-truth number of people living in a certain location against the number of Twitter users of different demographics.

Debiasing Models for Population Inference
More specifically, we evaluate five models requiring different amounts of data. The first model is a baseline, the next three models are based on the assumption of homogeneous inclusion probabilities (Equation 1), whereas the last one is based on the inhomogeneity assumption (Equation 3): N ∼ M is our base model that uses only the total population count from the census (N ) and Twitter (M). N ∼ g M(g) uses gender marginal counts only (i.e., the total counts of males and females not broken down by ages). N ∼ a M(a) uses age marginal counts only. N ∼ a,g M(a, g) uses the joint histograms inferred from Twitter but only the total population values from the census. log N(a, g) ∼ log M(a, g) + a + g uses the joint histograms inferred from Twitter and the joint histograms from the census.
Note that Twitter users are biased in various ways: the platform is more accessible to tech-savvy individuals and citizens of various countries use it to a different extent [34]. To distinguish the global effect of Twitter on the overall bias from the local effect of country, we use multilevel models. Namely, all slopes and intercepts in the introduced models have random effects specific to a country. Thus, from fixed effects of the model, we obtain global debiasing coefficients (i.e., the biases of Twitter users) and, from random effects, the country-specific debiasing coefficients (i.e., a given country has its own bias towards Twitter).
Note that the homogeneity and inhomogeneity assumptions apply within each country separately. Namely, we group samples by regions of a country, described next. Then, the homogeneity assumption translates into the same inclusion probability for a given demographic across all regions of a country, whereas the inhomogeneity assumption translates into a dependence of the inclusion probability on regions of a country that follows Equation 2.

Data
We retrieved joint population distributions for age and gender on a regional level (NUTS3) for 26 countries of the European Union plus four EFTA members as made available through the CensusHub system of Eurostat, the European Statistical Office. 6 All census data is from 2011, the most recent year for which comprehensive census data was collectively reported to Eurostat. NUTS3 regions are the finest-grained level of the "Nomenclature of Territorial Units for Statistics" used as a standard for statistical reporting in EU member states [26]. They are based on existing local administrative boundaries and are usually at the level of local districts. Although sizes differ to a certain extent per country, NUTS3 provide the most standardized cross-country geographical units to subdivide populations. We use the 2010 iteration of the NUTS3 regions as these correspond with the 2011 census subdivisons and is the most recent census available.
Our sample of social media users is derived from the random 10% stream of tweets from Twitter. We recorded all users observed from September 2015 to January 2016 and downloaded profile photos for all users in spring 2018. The location of each user is inferred using the method of Compton et al. [19]. This model predicts a user's latitude and longitude and was shown to be least biased with respect to urban and rural areas [41], which is important given the diversity in population centers in our study. The model was trained on a social network of 781M edges constructed from Twitter data spanning 2012 to 2017. Five-fold cross validation reports a median inference error of 7.9km, which is sufficiently accurate for the geographic granularity we use here. Each user is assigned to a NUTS3 region in Europe if the inferred latitude-longitude pair of that user lies within the boundaries of the NUTS3 regions; users not in these regions are discarded. Ultimately, we obtain a dataset of 3,202,964 users within the NUTS3 regions for our study. In the remainder, we will refer to NUTS3 regions simply as "regions. " The age, gender, and is-organization variables for each user in our dataset are inferred using M3. As an additional experiment to quantify the impact of including non-human accounts, we ignore any organization classification and treat all accounts as humans, grouping them according to their inferred demographics.

Results
We evaluate the debiasing models in the following cross-validation settings: leave one region out, leave one country out (i.e., leave out all regions from a given country), and leave one stratum out (e.g., leave out only males aged [30][31][32][33][34][35][36][37][38][39]. Per case, we measure the mean absolute percentage error of the population estimates for the left-out samples, i.e., MAPE(N ) = 100% whereN i and N i are the predicted and actual population sizes, respectively, and the sum is over all regions. Note that the model log N (a, д) ∼ log M(a, д) + a + д operates naturally in the log space; hence, before calculating the error of this model we first exponentiate the predicted population sizes to make the results more comparable across all models. Its results should further be compared to the remaining models with care, as the effects of moving to log space cannot be clearly untangled from gains due to learning on joint age and gender attributes.
The results of the leave-one-region-out evaluation show a clear benefit to debiasing based on the inferred demographics ( Figure 6). For the N ∼ M model without demographics, MAPE is 88%. The inclusion of the inferred gender or age demographics in the debiasing models, via N ∼ M(a) and N ∼ M(д), decreases MAPE to 59% and 61%, respectively. Including inferred joint distributions, e.g., the number of males of age 30-39, in the model N ∼ M(a, д), decreases MAPE further to 54%, even though we do not use jointdistribution data from the census for training. In fact, the use of census joint distributions at the stage of model training along with a move to log space, via log N (a, д) ∼ log M(a, д) + a + д, further improves the accuracy, bringing the MAPE down to 33%. To understand the effect of accounts belonging to organizations on this population prediction task, we compare the results of the debiasing models with and without removing the organizations ( Figure 6). Removing organizations results in a small reduction in error for all but the model trained with only marginal gender counts, though these differences are not statistically significant. A further analysis shows that the presence of organizations is heavily skewed towards populous metropolitan cities and most regions have very few organizations (i.e., organizations are not equally geographically distributed). However, for populous regions, organizations can create significant error.
To gain further insights into the introduced models, we show scatter plots of true and predicted population sizes for each model ( Figure 7). The model with joint Twitter and census counts is noticeably closer to the y = x line, likely because this models is trained in log space. For this model, we also depict the geographical variation of MAPE across the EU regions ( Figure 8).
The above results test predictions when one region is left out and its population is predicted. However, in different circumstances, more or less census information could be available. We test two other cross-validation settings: leave one stratum out (e.g., leave out only females aged [30][31][32][33][34][35][36][37][38][39] and leave one country out (i.e., leave out all regions from a given country). The latter reflects the generalizability of the model to completely unseen countries where platform adoption probabilities and country-specific biases are not known. The evaluation results for all three cross-validation settings are plotted in Figure 9 for the most accurate model, i.e., log N (a, д) ∼ log M(a, д) + a + д. From the first additional crossvalidation setting, we learn that hiding the population sizes for a specific stratum in all regions results in a minimal penalty to the prediction accuracy. For the second additional cross-validation setting, we see that the error rate nearly doubles with an average MAPE of 81%. This result suggests that knowledge of country-specific platform biases is important for accurate estimates and that at least some regions within the country should be seen during a model's training time to reach high performance.

Discussion of Debiasing Results and Potential Sources of Error
Debiasing social media data samples is a difficult task, but our results show that automatic post-stratification with respect to inferred age and gender notably improves population estimation results. Furthermore, our predictive models are fully interpretable, which allows us to estimate the inclusion probabilities and share them with the research community for future reuse. Even when joint distributions are not given by the census, inferring joint distributions from social media data with the model

MAPE (lower is better)
Leave One Country Out

Leave One Region Out
Leave One Stratum Out Figure 9: Comparisons of the model with joint Twitter and census distributions as increasing amounts of information are held out from training: (i) leaving out one stratum, (ii) leaving out all the strata from one region, and (iii) leaving out all regions in one country.
N ∼ M(a, д) provides a significant increase in accuracy in population prediction tasks compared to the baseline without debiasing. However, the model log N (a, д) ∼ log M(a, д)+a +д is notably more accurate, suggesting that the assumption of homogeneity does not hold for the partition of citizens of a particular country, gender, and age into regions. Note, however, that the predictions of the most accurate model are not perfect. This may be caused by the time mismatch of about 5 years between the Twitter and NUTS3 datasets. On the other hand, this point opens the door for developing debiasing models based on other inhomogeneity assumptions and searching better partitions, e.g., our multilevel model could have more levels to capture biases shared by smaller regions within a country.
To learn more about potential confounders, we compare the MAPE of the model log N (a, д) ∼ log M(a, д) + a + д to three variables: the area of the region, its population density, and average income ( Figure 10). NUTS3 regions are defined by each country and Figure 10: Comparing the model with joint Twitter and census distributions MAPE to region area (km 2 ), population density (people/km 2 ), and income (USD per capita) shows that the model performance is not biased towards particular types of regions. Colors represent countries as in Figure 7.
vary considerably in land area from city-states in Germany to large regions in Sweden (up to 98,000 km 2 ). We do not find correlations of land area with the MAPE error. Hecht and Stephens [35] found a clear urban-rural bias in social media data: there tend to be more users, more information, and higher quality information per capita within metropolitan areas. Despite this bias, we find our estimates not heavily correlated with population density, suggesting our models do well at debiasing these differences in inclusion probabilities. Only two countries (Czech Republic and Norway) show a correlation that is statically different from zero across our models. Finally, income has been suggested as an important explanatory variable for who uses Internet-based platforms and hence could be a potential confound to our estimates [14,40]. However, we similarly find that income and MAPE are not correlated significantly in our models.

CONCLUSION
The everyday opinions expressed in social media provide promising opportunities for measuring population-level statistics for health metrics, political outcomes, or general attitudes. However, social media is a non-representative sample of the population due to demographic skew in usage frequencies and access rates. As such, any direct estimate from a platform is likely biased towards certain demographics. This work provides a holistic solution to this problem by developing a novel method for assigning users to demographic strata and exploiting it in a regression framework for debiasing that allows direct estimation of the probability of an individual with given demographics to be on the given social media platform. Our work provides three main contributions. First, we introduce a state-of-the-art neural system for multi-attribute classification in 32 languages. This contribution also includes the system release and the creation of a new dataset of gender, age, and is-organization annotations in 32 languages. Second, we derive a series of models to debias social media measurements and provide their explicit formal interpretations. Third, in a massive study of all of Europe, we show that our two methods are able to infer regional population counts accurately and provide demographic corrections for all downstream measurements on the grounds of the estimated inclusion probabilities. These results pave the way for more accurate social sensing by laying a foundation of representative population sampling in social media. Code, software, and debiasing coefficients pertaining to this work are released for public use at https://github.com/euagendas/.