Aim Identify the ancestry of contemporary Brazilians; Strategy Surnames as a source of information; Methods Historical data on immigrants Fuzzy matching Machine learning
Next steps
Motivation
Data
Methods
Results classification
Exploratory analysis
Next steps
Motivation
No information on ancestry in Brazil (except for 5 categories of color/race) Potential applications Labor market discrimination; Diversity and productivity; Effects of immigration; Epidemiology; (Suggestions?)
Motivation
Data
Methods
Results classification
Exploratory analysis
Literature
Many studies on language identification Limited literature on surname recognition: Mateos (2007) for England; Konstantopoulos (2007) for soccer players’s surnames; Florou & Konstantopoulos (2011) for Nordic surnames; Komahan & Reidpath (2014) for Malaysia;
There is no study for Brazil.
Next steps
Motivation
Data
Methods
Results classification
Exploratory analysis
Next steps
Names in Brazil
Most people have 2 surnames (mother + father); Traditionally married woman substitute the mother’s surname for the husband’s ; First names (many times compound) do not provide much information on ancestry: "I never managed to figure out the Brazilian names. They defy all onomastic dictionaries, and exist only in Brazil." Umberto Eco in Foucault’s Pendulum.
Motivation
Data
Methods
Immigrant arrivals
Source: Levy (1974)
Results classification
Exploratory analysis
Next steps
Motivation
Data
Methods
Results classification
Exploratory analysis
Next steps
Non-iberian immigration to Brazil
from 1872 to 1972: 3,5 million non-Iberian immigrants arrived in Brazil; most before 1929. 54% Italians; 8.3% Japanese; 7.5% Germans; 29.6% Others.
Motivation
Data
Methods
Results classification
Exploratory analysis
Data surname/ancestry
Museu da imigração de SP (186,000 observations); US Censuses microdata 1880/1910; (4.6 millions obs.); Japanese Wikipedia (1,500 obs.); Other on-line sources on specific regions: Veneto and Pomeranian . Total of 80,000 unique pairs of surnames and ancestry
Next steps
Motivation
Data
Methods
Results classification
Exploratory analysis
Data source contemporary surnames
RAIS 2013: 48 million workers; 81 million surnames;
640,000 unique surnames (including typos).
Next steps
Motivation
Data
Overview 1
Methods
Results classification
Exploratory analysis
Next steps
Motivation
Data
Methods
Results classification
Exploratory analysis
Fuzzy matching
Fuzzy matching
String matching Optimal String Alignment: distance=d Match if d changes are necessary to make strings identical. (Includes transposition of adjacent characters) Example d=1 Müller=(Miller=Mueller), but Müller 6= ( Miler)
Next steps
Motivation
Data
Methods
Results classification
Exploratory analysis
Machine Learning
Two algorithms:
Cavnar and Trekle (1994); Naive Bayes. In common: ngram; ngrams are independent position of each ngram in the word does not matter
Next steps
Motivation
Data
Methods
Results classification
Machine Learning
N-grams
Break the surnames in ngrams Example "LIMA" has six 3grams: __l, _li, lim, ima, ma_, a__
Exploratory analysis
Next steps
Motivation
Data
Methods
Results classification
Exploratory analysis
Next steps
Machine Learning
Naive Bayes
Steps: 1
Start with prior based on ancestry known distribution (Censo 1920);
2
Use training dataset calculate probabilities of ngram conditional on ancestry;
3
For each "new" surname, use Bayes rule to update probability of ancestry conditional on surname ;
4
Get ancestry that has the higher probability.
Motivation
Data
Methods
Results classification
Exploratory analysis
Next steps
Machine Learning
Cavnar & Trekle
1
Create profile of each ancestry: rank ngrams by frequency in the training data;
2
Compare the ngram of each new surname with the profile of each language;
3
Choose the ancestry closer to the reference rank.
Motivation
Data
Overview 2
Methods
Results classification
Exploratory analysis
Next steps
Motivation
Data
Methods
Results classification
Exploratory analysis
Next steps
Fuzzy matching
Results
378,000 unique surnames fuzzy matched 81 million contemporary surnames; IBR 84.5% ITA 7.5% GER 2.7% EAS 0.5% JPN 0.4% unmatched 4.4%
Apply machine learning to the 4.4% of non-matched surnames.
Motivation
Data
Methods
Results classification
Exploratory analysis
Fuzzy matching
Results: Naive Bayes
Table: Predicted X Actual surname
Actual EAS GER IBR ITA JPN Problems: Too Slow; Accuracy= 69%.
Prediction EAS GER 0.30 0.42 0.01 0.84 0.00 0.24 0.00 0.18 0.01 0.15
IBR 0.02 0.01 0.15 0.04 0.00
ITA 0.24 0.13 0.59 0.78 0.13
JPN 0.02 0.00 0.01 0.00 0.70
Next steps
Motivation
Data
Methods
Results classification
Exploratory analysis
CT results
Results: Cavnar and Trekle
Table: Predicted X Actual surname
Actual EAS GER IBR ITA JPN Accuracy= 79%.
Predicted EAS GER 0.66 0.15 0.08 0.84 0.04 0.04 0.02 0.03 0.01 0.00
IBR 0.10 0.04 0.70 0.16 0.02
ITA 0.07 0.03 0.18 0.78 0.02
JPN 0.03 0.01 0.03 0.02 0.95
Next steps
Motivation
Data
Methods
Results classification
CT results
Results
Table: Predicted ancestry
Ancestry IBR ITA GER EAS JPN
Share total 82% 11% 4% 1.3% 0.7%
Exploratory analysis
Next steps
Motivation
Data
Methods
Maps
Spatial distribution
Results classification
Exploratory analysis
Next steps
Motivation
Data
Methods
Scatterplots
Wage by ancestry
Results classification
Exploratory analysis
Next steps
Motivation
Data
Methods
Results classification
Exploratory analysis
Scatterplots
Municipalities: share immigrant ancestry X income
Next steps
Motivation
Data
Methods
Results classification
Exploratory analysis
Scatterplots
Municipalities: share immigrant ancestry X inequality
Next steps
Motivation
Data
Methods
Results classification
Exploratory analysis
Scatterplots
Municipalities: share immigrant ancestry X poverty
Note: Specification includes controls for race/color, age, education, disability, gender and state. Reason: Discrimination in favor of foreign surnames? Non-observables (quality of schooling)?
Next steps
Motivation
Data
Methods
Results classification
Exploratory analysis
Next steps
Next steps
Add new datasets to ancestry data (specially for Brazilians and Japanese); Calibrate Naive Bayes to improve accuracy; Support Vector Machine; Apply algorithm to street names (proxy for local historical impact of immigration); Improve Mincerian regression; ?
Motivation
Data
EXTRA SLIDES
Methods
Results classification
Exploratory analysis
Next steps
Motivation
Data
Methods
Results classification
Historical ancestry and surnames data
Exploratory analysis
Next steps
Motivation
Data
Methods
Results classification
Exploratory analysis
Race/color demographics in contemporary Brazil
Afro-Brazilian (8 %); "Mixed" Brazilian (43 %); "White" Brazilian (48 %) East Asian Brazilian (1 % in 2010, but 0.45% in 2000) Native Brazilian (0.5%)