Surnames UCLA.pdf

Viewer
Transcript

Motivation

Data

Methods

Results classification

Exploratory analysis

Ancestry in Brazil: a surname based method Leonardo Monasterio Ipea/UCB Brazil Research Seminar / UCLA

March 1, 2016

Next steps

Motivation

Data

Methods

Overview 1

Motivation

2

Data

3

Methods Fuzzy matching Machine Learning

4

Results classification Fuzzy matching CT results Maps

5

Exploratory analysis Scatterplots Regression analysis

6

Next steps

Results classification

Exploratory analysis

Next steps

Motivation

Data

Methods

Results classification

Exploratory analysis

Overview

Aim Identify the ancestry of contemporary Brazilians; Strategy Surnames as a source of information; Methods Historical data on immigrants Fuzzy matching Machine learning

Next steps

Motivation

Data

Methods

Results classification

Exploratory analysis

Next steps

Motivation

No information on ancestry in Brazil (except for 5 categories of color/race) Potential applications Labor market discrimination; Diversity and productivity; Effects of immigration; Epidemiology; (Suggestions?)

Motivation

Data

Methods

Results classification

Exploratory analysis

Literature

Many studies on language identification Limited literature on surname recognition: Mateos (2007) for England; Konstantopoulos (2007) for soccer players’s surnames; Florou & Konstantopoulos (2011) for Nordic surnames; Komahan & Reidpath (2014) for Malaysia;

There is no study for Brazil.

Next steps

Motivation

Data

Methods

Results classification

Exploratory analysis

Next steps

Names in Brazil

Most people have 2 surnames (mother + father); Traditionally married woman substitute the mother’s surname for the husband’s ; First names (many times compound) do not provide much information on ancestry: "I never managed to figure out the Brazilian names. They defy all onomastic dictionaries, and exist only in Brazil." Umberto Eco in Foucault’s Pendulum.

Motivation

Data

Methods

Immigrant arrivals

Source: Levy (1974)

Results classification

Exploratory analysis

Next steps

Motivation

Data

Methods

Results classification

Exploratory analysis

Next steps

Non-iberian immigration to Brazil

from 1872 to 1972: 3,5 million non-Iberian immigrants arrived in Brazil; most before 1929. 54% Italians; 8.3% Japanese; 7.5% Germans; 29.6% Others.

Motivation

Data

Methods

Results classification

Exploratory analysis

Data surname/ancestry

Museu da imigração de SP (186,000 observations); US Censuses microdata 1880/1910; (4.6 millions obs.); Japanese Wikipedia (1,500 obs.); Other on-line sources on specific regions: Veneto and Pomeranian . Total of 80,000 unique pairs of surnames and ancestry

Next steps

Motivation

Data

Methods

Results classification

Exploratory analysis

Data source contemporary surnames

RAIS 2013: 48 million workers; 81 million surnames;

640,000 unique surnames (including typos).

Next steps

Motivation

Data

Overview 1

Methods

Results classification

Exploratory analysis

Next steps

Motivation

Data

Methods

Results classification

Exploratory analysis

Fuzzy matching

Fuzzy matching

String matching Optimal String Alignment: distance=d Match if d changes are necessary to make strings identical. (Includes transposition of adjacent characters) Example d=1 Müller=(Miller=Mueller), but Müller 6= ( Miler)

Next steps

Motivation

Data

Methods

Results classification

Exploratory analysis

Machine Learning

Two algorithms:

Cavnar and Trekle (1994); Naive Bayes. In common: ngram; ngrams are independent position of each ngram in the word does not matter

Next steps

Motivation

Data

Methods

Results classification

Machine Learning

N-grams

Break the surnames in ngrams Example "LIMA" has six 3grams: __l, _li, lim, ima, ma_, a__

Exploratory analysis

Next steps

Motivation

Data

Methods

Results classification

Exploratory analysis

Next steps

Machine Learning

Naive Bayes

Steps: 1

Start with prior based on ancestry known distribution (Censo 1920);

2

Use training dataset calculate probabilities of ngram conditional on ancestry;

3

For each "new" surname, use Bayes rule to update probability of ancestry conditional on surname ;

4

Get ancestry that has the higher probability.

Motivation

Data

Methods

Results classification

Exploratory analysis

Next steps

Machine Learning

Cavnar & Trekle

1

Create profile of each ancestry: rank ngrams by frequency in the training data;

2

Compare the ngram of each new surname with the profile of each language;

3

Choose the ancestry closer to the reference rank.

Motivation

Data

Overview 2

Methods

Results classification

Exploratory analysis

Next steps

Motivation

Data

Methods

Results classification

Exploratory analysis

Next steps

Fuzzy matching

Results

378,000 unique surnames fuzzy matched 81 million contemporary surnames; IBR 84.5% ITA 7.5% GER 2.7% EAS 0.5% JPN 0.4% unmatched 4.4%

Apply machine learning to the 4.4% of non-matched surnames.

Motivation

Data

Methods

Results classification

Exploratory analysis

Fuzzy matching

Results: Naive Bayes

Table: Predicted X Actual surname

Actual EAS GER IBR ITA JPN Problems: Too Slow; Accuracy= 69%.

Prediction EAS GER 0.30 0.42 0.01 0.84 0.00 0.24 0.00 0.18 0.01 0.15

IBR 0.02 0.01 0.15 0.04 0.00

ITA 0.24 0.13 0.59 0.78 0.13

JPN 0.02 0.00 0.01 0.00 0.70

Next steps

Motivation

Data

Methods

Results classification

Exploratory analysis

CT results

Results: Cavnar and Trekle

Table: Predicted X Actual surname

Actual EAS GER IBR ITA JPN Accuracy= 79%.

Predicted EAS GER 0.66 0.15 0.08 0.84 0.04 0.04 0.02 0.03 0.01 0.00

IBR 0.10 0.04 0.70 0.16 0.02

ITA 0.07 0.03 0.18 0.78 0.02

JPN 0.03 0.01 0.03 0.02 0.95

Next steps

Motivation

Data

Methods

Results classification

CT results

Results

Table: Predicted ancestry

Ancestry IBR ITA GER EAS JPN

Share total 82% 11% 4% 1.3% 0.7%

Exploratory analysis

Next steps

Motivation

Data

Methods

Maps

Spatial distribution

Results classification

Exploratory analysis

Next steps

Motivation

Data

Methods

Scatterplots

Wage by ancestry

Results classification

Exploratory analysis

Next steps

Motivation

Data

Methods

Results classification

Exploratory analysis

Scatterplots

Municipalities: share immigrant ancestry X income

Next steps

Motivation

Data

Methods

Results classification

Exploratory analysis

Scatterplots

Municipalities: share immigrant ancestry X inequality

Next steps

Motivation

Data

Methods

Results classification

Exploratory analysis

Scatterplots

Municipalities: share immigrant ancestry X poverty

Next steps

Motivation

Data

Methods

Results classification

Exploratory analysis

Regression analysis

Table: Mincer regression

log(w) Brazilian-Foreign surnames Foreign-Foreign surnames

0.042∗∗∗ (0.001) 0.119∗∗∗ (0.002)

Note: Specification includes controls for race/color, age, education, disability, gender and state. Reason: Discrimination in favor of foreign surnames? Non-observables (quality of schooling)?

Next steps

Motivation

Data

Methods

Results classification

Exploratory analysis

Next steps

Next steps

Add new datasets to ancestry data (specially for Brazilians and Japanese); Calibrate Naive Bayes to improve accuracy; Support Vector Machine; Apply algorithm to street names (proxy for local historical impact of immigration); Improve Mincerian regression; ?

Motivation

Data

EXTRA SLIDES

Methods

Results classification

Exploratory analysis

Next steps

Motivation

Data

Methods

Results classification

Historical ancestry and surnames data

Exploratory analysis

Next steps

Motivation

Data

Methods

Results classification

Exploratory analysis

Race/color demographics in contemporary Brazil

Afro-Brazilian (8 %); "Mixed" Brazilian (43 %); "White" Brazilian (48 %) East Asian Brazilian (1 % in 2010, but 0.45% in 2000) Native Brazilian (0.5%)

Next steps

Name of Organisation: Commencement Date: Fill Names and Surnames ...

$pdf-1829\the-surnames-of-scotland-their-origin ...$

pdf-1829\the-surnames-of-scotland-their-origin ...

Overview. 1 Motivation. 2 Data. 3 Methods. Fuzzy matching. Machine Learning. 4 Results classification. Fuzzy matching. CT results. Maps. 5 Exploratory analysis.

Download PDF

2MB Sizes 3 Downloads 90 Views

Report

Name of Organisation: Commencement Date: Fill Names and Surnames ...

pdf-1829\the-surnames-of-scotland-their-origin ...

Surnames UCLA.pdf

Recommend Documents