Finding term equivalents in dynamically-built aligned ...

Viewer
Transcript

Finding term equivalents in dynamically-built aligned corpora Caroline Barrière Interactive Language Technologies Group September 25th 2008 Technical Report NRC 50726. ERB-1156. Abstract This report presents an approach based on sentence partial overlaps to find term equivalents in an aligned corpus dynamically built by the WeBitext software (Désilets et al. 2007). Our experimentations are based on 200 terms taken from the Grand Dictionnaire Terminologique (GDT)1. Our results show that for terms relatively frequent within the Canadian Government sites (sites used by the WeBitext software), our method can automatically find a known equivalent (listed in GDT) in over 80% of the cases. Our results also show that our approach suggests new plausible equivalent candidates which could be validated by a terminologist in a process of enrichment of a terminological database. Résumé Ce rapport vise la présentation d’une approche par recoupement de phrases pour la recherche de termes équivalents dans un corpus aligné tel que généré dynamiquement par le logiciel WeBitext (Desilets et al. 2007). Nos expérimentations sont construites à partir de 200 termes extraits du Grand Dictionnaire Terminologique (GDT) 2 . Nos résultats montrent que pour les termes ayant un minimum d’occurrences parmi les sites en ligne du gouvernement canadien (sites à partir desquels WeBitext construit son corpus), notre méthode retrouve automatiquement plus de 80% des termes équivalents déjà répertoriés dans le GDT. Nos résultats montrent aussi que notre approche permet de suggérer de nouveaux équivalents candidats plausibles qui pourraient être validés par un terminologue dans un but d’enrichissement d’une base terminologique. Keywords aligned bilingual corpus, term equivalent, terminological database Mots-clés corpus bilingue aligné, termes équivalents, base terminologique 1

The « Grand Dictionnaire Terminologique » (GDT) published by the « Office Québécois de la Langue Française (OQLF) » is available at http://www.granddictionnaire.com. 2 Le « Grand Dictionnaire Terminologique » (GDT) est publié par l’Office Québécois de la Langue Française (OQLF) et est disponible à l’adresse suivante: http://www.granddictionnaire.com.

1. Introduction Maintaining and updating terminological databases is not an easy task, and few tools are available to help the terminologist in such task. When a new domain is to be explored, an extensive thematic search can be performed to find important concepts and terms within that domain. Some tools have been specifically designed to help thematic searches, such as TerminoWeb (Barrière and Agbago 2006). Other tools, such as term extractors (Cabré et al. 2001, Drouin 2003) can be helpful by finding terms for that new domain. But if the task is to update a term bank to acknowledge for the quickly evolving nature of language, then, to our knowledge, not so many tools are available. A tool such as Barçah (Quirion 2005) can be used to validate the usage of a set of known terms representing a particular concept in different types of communication (such as government, educational or journalistic communications). But in complement to that, we constantly need to enlarge the known subset of terms representing the concept. To do so, we suggest an approach which takes advantage of the bilingual nature of a term bank and the availability of a bilingual corpus. More particularly, our method is based on trying to answer the following question “Knowing X is an English term for a particular concept, is it possible to automatically find French equivalents to X?”3 This assumes that a terminological database already exists and that a record already contains the English term X for a particular concept. The main underlying assumption is that terms known in one language become the entry point into the data (corpus) to find terms in the other language. This is a different approach from working within a monolingual point of view and studying term variants and synonyms. Work on term variants and their automatic discovery in corpus has been done by Jacquemin (2001) and also Daille (2001) among others. The literature in computational linguistics contains work on automatically finding synonyms (Yu and Agichtein 2003) or more often near-synonyms (Edmonds and Hirst 2002). Automatic finding of synonymy in monolingual corpus is often performed by statistical methods or by linguistic-based methods similarly used in the discovery of other semantic relations such as hyperonymy or meronymy (Auger and Barrière 2008). The bilingual approach suggested in this report requires the availability of an aligned corpus containing the terms of interest. WeBitext (Desilets 2007) also being developed at the National Research Council of Canada allows the automatic generation of such a corpus. WeBitext can dynamically build a bilingual aligned corpus, gathering information from the Canadian government sites (sites within the web domain .gc.ca). From this bilingual corpus, our goal is to extract equivalent candidates, so that the terminologist quickly finds different candidates without having to read through the bilingual corpus. Even if we envisage our method in a knowledge discovery process to enrich a terminological database with new information, it is difficult to automatically evaluate our 3

The source and target language could be reversed. Although, the present state of WeBitext (which we use in our experiments) only allows searches using English as the source language. Newer versions of WeBitext will allow bi-directional searches.

method in such task. Instead, we propose an automatic evaluation to measure the success of our method at finding what is already stored and known in a terminological database. The plausible underlying assumption is that an approach good at recall will be good at discovery. Here are the tasks involved in our experimentation: 1) Randomly choose 200 terms in the Grand Dictionnaire Terminologique (GDT), 2) Use WeBitext to build an aligned bilingual corpus around each term, 3) Use a partial sentence overlap method to find equivalent candidates, 4) Compare these candidates to the ones found in GDT. This report presents each step, giving examples and showing results. Task 3 above is the core of our approach and our main technological contribution. The report also discusses the use of our method for a term bank validation or enhancement. Such idea links our suggested technology to the working environment of a terminologist.

2. Generating test data The first step is to obtain a list of terms for the experimentation to be performed. The terms were randomly selected from the Grand Dictionnaire Terminologique (GDT) 4 . Some examples of these terms are shown in Table 1. Randomness is guaranteed by letting a computer program make the choices rather than a human. The randomness is “total”, as we have not favored any particular domain, nor particular term frequency or length. As we can see in Table 1, some terms show a higher degree of polysemy than others. For example, “exhaust port” is present on 8 records, showing its presence in 8 different domains. Some terms are more specific than others. For example, the term “paraffin washer for yarn reeling” is much more specific than the term “disinfecting”. A Google hit count5 being a quick measure of specificity, the former has 2 hits and the latter has 1.9 millions. This shows well the extent of the variation among terms which will definitely impact on the results.

4

The Interactive Language Technology Group would like to thank the Office Québecois de la Langue Française for providing us with a copy of their terminological database for research purposes. 5 Performing a search on www.google.com and obtaining the number of pages returned.

Table 1 – Examples of randomly selected terms English Term French Equivalent 1 straightedge règle rectifiée

2 3

straightedge vinyl cellulose

règle de précision vinyl cellulose

exhaust port

orifice d'échappement

exhaust port

orifice d'échappement

exhaust port exhaust port

orifice d'échappement conduite d'évacuation des gaz chauds

exhaust port

orifice de sortie

exhaust port exhaust port

prise de fumée orifice de sortie

exhaust port

lumière d'échappement

hijacking

détournement d'avion

hijacking hijacking

détournement illicite détournement d'avion ressort à lames montées sur glissières exploitation à ciel ouvert

4

5 6 7 8 9

slipper spring strip mining boiling period desinfecting phillips head screw

temps d'ébullition désinfection vis cruciforme

10 hip lift 11 pocket spectroscope 12 paraffin washer for yarn reeling

porté spectroscope de poche Rondelle de paraffine pour bobinage de fils textiles

Domains automotive industry automobile maintenance tools office equipment and supplies office supplies paper industry automotive industry mechanical engineering automobile exhaust safety safety equipment electricity circuit breakers mining industry mine ventilation physics fluid mechanics ceramics ceramic tile drying and firing aeronautics railways railway traction equipment transport air transport transport air transport law automotive industry automobile suspensions energy brewing wort boiling and hopping cleaning mechanical engineering screws and bolts sports skating geology gemmology textiles spinning

3. Finding term equivalents This section describes the core of our approach, which is the processing of a dynamically-built bilingual corpus to extract term equivalents. But first, we must acquire such corpus, which we characterize as “dynamic” because it is built on the fly (real time) around a particular term. For doing so, a search for parallel sentences is launched using the software WeBitext which explores the sites of the Canadian government (as ending with “.gc.ca”) to retrieve sentences containing the term of interest. In some way, WeBitext behaves as if all of .gc.ca was a large translation memory6 (Desilets et al. 2007). Given the variation in specificity and domain coverage of the selected terms, not all terms are found within the Canadian government sites. It is no surprise that such term as “paraffin washer for yarn reeling”, occurring only 2 times on the whole WWW, would not be found. The number of occurrences found for each term also varies. Table 2 gives the number of sentences retrieved by WeBitext7. More than half the terms are not found at all, leaving 89 terms with minimally one occurrence. Table 2 – Number of sentences retrieved by WeBitext 0 1-5 5-10 11-20 20-50 50-100 > 100 sentences sentences sentences sentences sentences sentences sentences 111 28 14 15 18 7 7

Total 200

We now describe the steps involved in finding the term equivalent candidates which are based on a simple hypothesis: “Given that all source sentences contain a particular term X in them, the aligned sentences on the target side must contain an equivalent to term X, and such equivalent could be extracted by finding what is common among these sentences”. The algorithm (let us call it Overlap Algorithm) is therefore summarized as: [step 1] - Find all sentences on the target side in the bilingual corpus (N sentences) [step 2] - Take sentences 2 at a time (so N*(N-1) pairs) - Find their longest common substring (LCS) - Add their LCS to a list of equivalent candidates (List C) [step 3] - For all N sentences - For each equivalent candidate from List C - If the sentence contains the equivalent candidate - Increase the frequency count of that candidate [step 4] - Sort the candidates in List C by reverse order of frequency 6

Canada being an officially bilingual country, all of its official publications are available in French and English, therefore corresponding web site are always available. 7

The numbers given are not necessarily the same as the total number of available sentences on .gc.ca containing the sample terms since the site exploration algorithm behind WeBitext has an elapsed search time limit to make it usable in a real-time setting.

Table 3a and 3b are examples of Step 1 for the terms « grade control » and « outboard ». The large .gc.ca translation memory contains English sentences in which « grade control » and « outboard » occur, and Table 3a and 3b show the French sentences associated to these English sentences as found in the corresponding French web sites. Results of this intermediate step contain some noise. For example, sentence 2 in Table 3a does not contain anything related to grade control, which could be due to a misalignment problem (when the aligned bilingual corpus is made with WeBitext). Sentence 9 in Table 3a is in English, not in the expected target French language. This might be due to the French web site containing some English sentences. Obviously problems at this step will impact results at the following step, but hopefully enough data (enough sentences available) will attenuate such errors to still allow our Overlap Algorithm to work well. Browsing through Tables 3a and 3b to manually extract possible French equivalents to the terms « grade control » and « outboard » does take some time. Searching through these tables already gives the reader an appreciation for the need of a tool which would automatically extract candidates from these sentences.

Table 3a – French sentences from bilingual corpus around term « grade control » 1 Pour augmenter la productivité, un front de taille avec une longueur de 4,8 m (16 pi) fut essayé mais a dû être arrêtée en raison de problème à suivre la veine; un front de taille avec une longueur maximale de 3,6 m (12 pi) est pris pour contrôler la teneur. 2 Les techniciens en géologie aident les géologues et les ingénieurs géologues. 3 Les déversoirs à tourbillon ne semblaient pas avoir beaucoup d'effet sur la pente de l'eau coulant sur le lit dont la rugosité a été accrue par l'installation de roches au fond du ponceau. 4 Aux sites d'exploitation minière, ils s'occupent généralement de la planification de la mine, de concert avec l'ingénieur minier; de plus, ils vérifient la teneur du minerai et font de l'exploration pour trouver de nouvelles réserves. 5 Afin de régler les problèmes de contrôle de teneur, le stérile est déblayé séparément; le séquençage des veines et la vérification conceptuelle sont menés à bien. 6 En raison de l'irrégularité du corps minéralisé, le contrôle de teneur s'effectue maintenant sur une période de 7 jours au lieu de 5. 7 Il n'y a pas de limites visuelles du contact pour le contrôle des teneurs. 8 Contrôle de teneur et système d'identification du minerai : 9 Grade control drilling of the upper 10 Problèmes de contrôle de teneur. 11 Suivi des teneurs

Table 3b – French sentences from bilingual corpus around term « outboard » 1 Celui-ci a acheté un moteur, un bateau et une remorque identiques et a d'abord dû payer les droits de douane sur le moteur et le bateau, mais, par suite d'un réexamen, le moteur a été classé dans le numéro tarifaire 8407.29.10 à titre de "Moteurs du type intérieur-extérieur". 2 Pour s'attaquer à court terme au problème de la pollution atmosphérique produite par les moteurs marins, Environnement Canada et l'Association canadienne des manufacturiers de produits nautiques (ACMPN) ont annoncé en janvier de cette année un Protocole d'entente visant à mettre volontairement sur le marché canadien des moteurs hors-bord et des motomarines moins polluants. 3 Motors in the Italian market divide into the following categories: 60% of the inboard motors run on diesel, 13% of two-stroke outboard motors run on gasoline, 10% run on outboard, four -stroke motors run on gasoline, 9% are inboard/outboard gasoline motors, inboard gasoline 0.09% and outboard diesel 0.029%. 4 Même si les moteurs hors-bord rejettent leurs émissions dans l'eau, de récentes études ayant trait à leurs effets sur les lacs ont montré que la plupart des composés hydrocarbonés présents dans l'eau migraient dans l'air en moins de six heures, et que des échantillons prélevés à une profondeur d'un mètre environ n'étaient pas contaminés. 5 Les bateaux à moteur avec moteurs du type intérieur-extérieur sont classés en vertu du numéro tarifaire 8903.92.00 et ce, en tant que bateaux à moteur complets du type intérieur-extérieur, conformément à la règle 1 des Règles générales pour l'interprétation (RGI) du Système harmonisé. 6 Les hors-bord qui fonctionnent au carburant diesel présentent aussi des possibilités en ce qui concerne la réduction des émissions et une plus grande économie de carburant, mais ils occupent actuellement très peu de place sur l'ensemble du marché total en raison de leur coût plus élevé et du manque de stations-services. 7 Les moteurs du type intérieur-extérieur, parfois désignés sous le nom de "stern drives", sont destinés à être fixés à l'intérieur de la coque à l'arrière du bateau, et ils sont combinés avec un bloc retenant une hélice de gouvernail qui est fixée à l'extérieur du bateau.

Table 4 shows the results of Steps 2-3-4 from the Overlap Algorithm. Numbers shown left of each candidate are the frequency of occurrences of the suggested equivalents (sorted in reverse order of frequency as said in Step 4). The results in Table 4 allow us to discuss a few issues. (1) Term variants Performing a frequency count and sorting will work best if a single obvious “winner” should come at the top of the list. But, this is not the case if multiple term variants exist for a concept. If the number of sentences is large enough, even if there are multiple variants, we hope to find them all in the list, each one covering a fraction of the occurrences. This assumes a rather uniform distribution of the variants among the examples which might not be the case.

Table 4 – Examples of French equivalents suggested for different terms Outboard Junior engineer Inventory items 37 : moteur 4 : ingénieur 17 : articles 36 : hors-bord 3 : mécanicien 9 : en stock 29 : outboard 3 : junior 8 : articles en stock 25 : moteurs 2 : ingénieure 6 : stocks 19 : outboard motor junior 5 : documents 18 : du type 2 : subalterne 5 : événements 16 : bateau 5 : nous avons 15 : moteurs du type 4 : dans le 12 : du type intérieur4 : échantillon extérieur 4 : liste maîtresse des 12 : bateaux 4 : articles de 11 : outboard motorboat 4 : correspondants 11 : moteur hors-bord 4 : gestion 11 : bateaux à moteur 4 : suivi des 10 : moteurs du type 4 : aux événements intérieur-extérieur majeurs

Grade control 8 : teneur 6 : contrôle 4 : contrôle de teneur 2 : en raison de 2 : teneurs 2 : problèmes de contrôle de teneur 2 : du minerai

Different variants are likely to be used in different types of communications (more or less formal, or written by different authors) as would be found in different sites. It would be advantageous that the sentences gathered by WeBitext and put in the aligned corpora come from different web sites, but there is no such validation in the present WeBitext version. On a different point, it is known in terminology that one of the most common form of terminological variant in technical writing is the head variant in which only the head is used (“articles en stock” referred to by “articles”, or “contrôle de teneur” referred to by “contrôle”) so as not to make the text too heavy. It is therefore no surprise that these head nouns come at the top of our list of candidates (as their frequency would add up as individual words and as part of compounds). As such form of variation would be found within a single web page, it could be important also from this point of view to obtain sentences from different web sites. (2) Polysemy Multiple term candidates might be suggested not just because of multiple variants for a same concept, but also for multiple terms for different concepts which just happen to be referred to with the same polysemic English term. Table 1 earlier showed examples of terms taken from the GDT with varied degrees of polysemy. The more general terms are, usually the more polysemic they are. To address this issue, we would require more specific web searches by domain, which is not yet available in WeBitext. In fact, if we combined this issue of polysemy with the issue of term variants presented in point (1) above, we would have two requirements, going in opposite directions, one of

restriction (from .gc.ca to a subdomain) and one of expansion (multiple sites within that subdomain). Dealing with polysemy requires a focus on a certain subset of sites within .gc.ca, subset which would represent a domain of interest, and then within that subset for a domain, we hope to find sentences coming from different sites to find different variants for the same concept. (3) Related Terms It is likely that noise in the candidate lists come from interference with related terms. For example, the term “outboard” might be present in sentences talking of “motorboards with outboard motors” being different from “inboard engines” or “inboard-outboard engines”. These terms being related, they would often appear in the same sentences. An equivalent candidate “moteurs du type intérieur-extérieur” (in Table 4) is actually an equivalent for the related term “inboard-outboard engine” rather than an equivalent for “outboard”. Although related terms do add noise to the list, it might be “good noise”, as it could trigger investigation from the terminologist into these terms as well to understand similarities and differences and help defining these terms in the term bank.

3. Results To validate the equivalent candidates found by our algorithm, we use as “gold standard” the content of the GDT records. Table 5 shows the French equivalents listed in GDT for the same four terms given in Table 4. Table 5 – Some examples of GDT French equivalents for terms Outboard Junior engineer Inventory items moteur hors-bord ingénieur stagiaire article d’inventaire hors-bord spoiler extérieur extérieur

Grade control contrôle de teneur contrôle de la qualité de la fibre

The results presented here are based on the content of a terminological database which we hope to extend with our approach. Although this is somewhat paradoxical, it is the only automatic evaluation method available. With no access to a terminologist to manually look at the results of our algorithm, the term bank (even if imperfect) allows an automatic validation of our results. This validation provides a recall measure which characterizes only to which extent we are able to replicate what is already known but gives no information on how well we can find new knowledge. We can hypothesize though, that if recall is low (or high), our discovery potential is also low (or high), so these measure are somewhat correlated. To present recall results, we look at the 89 terms for which WeBitext had returned sentences containing them, and compare the Top 5, 10 and 20 candidates given by our

algorithm to the equivalents within GDT (regardless of the domains, which is obviously a simplification). Table 6 shows the results.

Table 6 – Recall of the Overlap Algorithm Candidates Recall Top 5 Top 10 Top 20

38.2% 40.4% 42.7%

The Overlap Algorithm is strongly influenced by the availability of sentences containing the searched term. To better emphasize this influence, we split the results with respect to the number of sentences returned by WeBitext and show these results in Table 7. Also, there might be a bilingual corpus found around a term, but that corpus does not contain any of the equivalents listed in the GDT, in which case it would be impossible to find it. That is a weakness of our combined WeBitext-Overlap method, but still, we should be aware that the recall is influenced by both the corpus and the algorithm searching in the corpus. Table 7 shows detailed results taking into consideration these variations. As the difference between Top 10 and Top 20 was not so large in Table 6, the results presented in Table 7 assume a Top 10 cut-off point. Table 7 shows how the terms for which WeBitext returns less than 5 sentences are the most problematic. Out of the 28 terms in that state, of which only 12 contain one of the equivalent from GDT (weakness of the corpus), only one time is the algorithm able to find the equivalent (weakness of the algorithm). Table 8 summarizes the recall of our method given different constraints. In “good conditions” (meaning the term we look for is present 5 times in .gc.ca), results presented in the previous sections shows that the method has a recall of 84%. When the conditions are less favorable (less than 5 sentences available), recall drops to 67%.

Table 7 – Influence of the number of sentences retrieved by WeBitext on Recall results <5 5-10 11-20 20-50 50-100 > 100 Total Nb terms with this 28 14 15 18 7 7 89 number of sentences Nb terms for which at 12 9 8 14 6 6 54 least a GDT equivalent is present Nb terms for which the 1 7 8 10 4 6 36 Overlap Algorithm finds one of the equivalents

Table 8 – Overlap algorithm recall in different conditions Nb terms Nb of terms for which the Overlap Algorithm has found an equivalent present in the GDT At least one sentence 89 36 At least one sentence 54 36 with a GDT equivalent At least 5 sentences 61 36 At least 5 sentences with 43 36 a GDT equivalent

Recall

40.5% 66.7% 59.0% 83.7%

4. Term bank expansion As mentioned in the introduction, the purpose of our technology development is to provide a tool helping terminologists to enhance, expand or validate their term bank. From this point of view, examples shown in Table 9 are interesting. Table 9 – Suggestions by the algorithm of terms not found in GDT Terme Suggestion GDT inventory item articles en stock article d’inventaire stretching étirements étirement musculaire whiskers moustaches monocristaux orientés (géologie) barbes (métallurgie) trichite (plastics) pupil élèves écolier air gun arme à air comprimé souflette à air pistolet pneumatique pistolet à air carabine à air comprimé operation plan du plan opérationel plan d’opération core temperature température interne température à coeur senior scientist chercheur principal maître de recherches roadmap carte routière calendrier de lancement flipper nageoires palme (sports) battant / flipper (leisure activities) junior engineer ingénieure junior ingénieur stagiaire

Unfortunately, for the developer, without the help of a terminologist, such results are difficult to evaluate. Still in Table 9, we have put examples of terms suggested by our method, terms which seem probable, but are not found in the GDT. A first example, the term “ingénieur stagiaire” liste in GDT is not one of the candidates. Instead, the algorithm suggests “ingénieur junior” which could be validated and added in the GDT. A second example, the term “article en stock” which is not in GDT (“article d’inventaire” is

listed) and for which a terminologists could decide whether its usage is wrong and therefore could be noted as “à éviter” (to avoid) in the term bank or if it is correct and therefore added as another term along “article d’inventaire” for the same concept. Candidates in Table 9 can be viewed as a starting point to explore new variants to be included in the term bank. Some variants might not be considered good variants, but since they are used in communications (e.g. “articles en stock” found in the Canadian government communications), they could still be included in the term bank with a particular restriction note.

5. Conclusion We presented a method to search for term equivalents in a dynamically built bilingual corpus. Our experiments, using terms from the GDT, show that in “good conditions”, meaning terms occurring in .gc.ca a minimum of 5 times, a recall of 84% is obtained using the GDT as the gold standard. As the usefulness of our approach would be in the expansion and validation of a term bank, we also presented in section 4 examples of equivalent candidates which are not present and could be added. Our approach has limitations and exploring the reasons of such limitations lead us to future work. A first limitation is the corpus used, that is publications from the Canadian government as found with .gc.ca sites. For specialized terms, such a corpus might not be the most appropriate, although given the various degrees of specialization of the randomly chosen terms in our experiment the 45% coverage observed is not negligible. Future research can focus on automatically finding specialized bilingual corpora on line. Certainly, some domains are more likely to be represented in bilingual sites than others. For example, bilingual sites in the health domain might be more present on the WWW than bilingual sites for the automotive industry domain. The second limitation is not in the corpus but in the overlap method itself. If the number of sentences is too small (less than 5), it is then difficult to find candidates. Future research should explore variations on the method for such cases. Future work should also refine our evaluation approach. As of now, it does not take into account simple variations such as gender or number and only accept perfect matches between candidates and known equivalents, which results in underestimating slightly our recall. Overall, this report presented results which establish the potential of the Overlap Algorithm in combination with the WeBitext software for the search of term equivalents. The method’s recall results are promising enough to envisage performing tests of a larger scale and also working with terminologists to better evaluate the method’s value in their working environment for tasks of term bank validation and enhancement.

Acknowledgments The author would like to thank the IIT-NRC WeBitext team (A. Désilets, B. Farley, M. Stojanovic) for having made possible an internal access to their software. The author would also like to thank the Office québécois de la langue française for allowing our research group (Interactive Language Technology Group) to use the Grand Dictionnaire Terminologique for computational terminology research purposes. Bibliography Auger, A. and Barrière, C. (2008) Pattern-based approaches to semantic relation extraction : A state-of-the-art, In Special Issue on Pattern-based Approaches to Semantic Relation Extraction, Auger, A. and Barrière, C. (eds), Terminology 14(1), 2008. Barrière, C. and Agbago, A. (2006) “TerminoWeb: a software environment for term study in rich contexts”, International Conference on Terminology, Standardization and Technology Transfer, Beijing, 103-113. Cabré Castellvi M.T., Estopa R. And Palatresi, J.V. (2001). Automatic term detection : A review of current systems. In Bourigault D., Jacquemin C., L’Homme M.C., Ed., Recent Advances in Computational Terminology, Vol. 2, pp. 53-87. Daille, B. (2001). “Qualitative terminology extraction: Identifying relational adjectives.” In Bourigault, D., C. Jacquemin and M.C. L’Homme (eds.). Recent Advances in Computational Terminology, Amsterdam / Philadelphia: John Benjamins, pp. 149-166. Désilets, A., Farley, B. and M. Stojanovic (2008) WeBiText: Building Large Heterogeneous Translation Memories from Parallel Web Content, Translating and the Computer 30. Drouin P. (2003). Term extraction using non-technical corpora as a point of leverage. Terminology, 9(1), pp. 99-117. Edmonds P. and G. Hirst (2002) Near-synonymy and lexical choice, Computational Linguistics, vol. 28, p. 105-144 Jacquemin, C. (2001). Spotting and Discovering Terms through Natural Language Processing, Cambridge: MIT Press. Quirion, Jean. (2005) L'automatisation de la terminométrie ; premiers résultats. 5e Congrès International : Langue et terminologie helléniques, Nicosie (Chypre), Athènes. Techniko Epimelitirio Elladas. 61-70. Yu H. and E. Agichtein. 2003. “Extracting synonymous gene and protein terms from biological literature”, Bioinformatics 19(1), Oxford University Press.

Finding term equivalents in dynamically-built aligned ...

degree of polysemy than others. For example ... rÃ¨gle rectifiÃ©e automotive industry ..... terms taken from the GDT with varied degrees of polysemy. The more ...

Download PDF

92KB Sizes 2 Downloads 229 Views

Report

Finding term equivalents in dynamically-built aligned ...

Recommend Documents