Transcription

Natural Language Processingfor Indian LanguagesA Language Relatedness PerspectiveAnoop KunchukuttanMicrosoft India Translation & Speech Group,[email protected] Workshop on Indian Language Data: Resources and Evaluation24th May 2020

Why Language Relatedness?

Usage and Diversity Indian Languages 4 major language families22 scheduled languages125 million English speakers8 languages in the world’s top 20 languages30 languages with more than 1 million speakersSources: Wikipedia, Census of India 2011Internet User Base in India (in million)Source: Indian Languages: Defining India’s Internet KPMG-Google Report 2017

EntityIdentificationEntity LinkingInformationExtraction &CategorizationQuestion &AnsweringRecommendationApplications requiring Indian language support

Scalability Challenges in ML solutions NLP requires human expertise difficult and expensive to replicate forevery language Annotated data Linguistic knowledge inputs Difficult to deploy and maintain systems for multiple languagesExpensive to create datasets for each language

Broad Goal: Build NLP Applications that can work on different languagesEnglishHindiMachine Translation SystemTamilPunjabiMachine Translation SystemCan we improve English-Hindi translation using Tamil-Punjabi model?Can we do English Punjabi translation even if this data is not seen in training?Can we train a single model for all translation pairs?

Need for a Unified Approach for Indic NLP Can we share resources across languages? Can that also reduce effort & cost for deployment and maintenance? Can diversity of languages lead to better generalization?Can we utilize relatedness between Indian languages?

What is Language Relatedness?

Why are Indian languages related?Related LanguagesRelated by GenealogyLanguage FamiliesDravidian, Indo-European, Turkic(Jones, Rasmus, Verner, 18th & 19th centuries, Raymond ed. (2005))Related by ContactLinguistic AreasIndian Subcontinent,Standard Average European(Trubetzkoy, 1923)Related languages may not belong to the same language family!9

Cognates & Borrowed words in Indian LanguagesEnglishIndo-AryanDravidianIndo-Aryan words inDravidian languagesOther borrowings like echowords, retroflex sounds inother direction. (Subbarao,2012)Vedic dRotikachapātī, roṭī roṭipaũ, roṭlāchapāti,poli, bhākarī daTelugufruitpazham , kannipazha.n , phala.nhaNNu , phalapa.nDu , krit alamMalayalamjala.mwaterSource: Wikipedia and IndoWordNet

Key Similarities between related languagesभारताच्या मित्त अिेररकेतील लॉस एन्जल्स शहरात काययक्रि आयोजजत करण्यात आलाbhAratAcyA svAta.ntryadinAnimitta ameriketIla lOsa enjalsa shaharAta kAryakrama Ayojita karaNyAta AlAभारता च्या स्वातंत्र्य दिना ननमित्त अिेररके तील लॉस एन्जल्स शहरा त काययक्रि आयोजजत करण्यात आलाbhAratA cyA svAta.ntrya dinA nimitta amerike tIla lOsa enjalsa shaharA ta kAryakrama Ayojita karaNyAta AlAभारत के स्वतंत्रता दिवस के अवसर पर अिरीका के लॉस एन्जल्स शहर िें काययक्रि आयोजजत ककया गयाbhArata ke svata.ntratA divasa ke avasara para amarIkA ke losa enjalsa shahara me.n kAryakrama Ayojita kiyA gayAMarathiMarathisegmentedHindiLexical: share significant vocabulary (cognates & loanwords)Morphological: correspondence between suffixes/post-positionsSyntactic: share the same basic word order11

Orthographic Similarity

Largely overlapping character set, but the visual rendering differs highly overlapping phoneme sets Highly consistent grapheme-to-phoneme mappingBrahmi-derived Indic scripts are orthographically similar

A simple and powerful property to utilizerelatedness between Indian languages

Script Conversion Read any script in any script Unicode standard enables consistent script conversion with a single ruleunicode codepoint(char) - Unicode range start(L1) Unicode range રલાAs a developer, you can read text in a script you understandOnly a single mapping needed for Romanization too

Multilingual Transliteration(Kunchukuttan, et al, 2018)Pool training रलkeralaKannadaಬ ೆಂಗಳೂರುbengalurukozhikodeTrain a joint transliteration model formultiple Indian languages to English& vice-versaExample of Multi-task LearningSimilar tasks help each otherConvert to a common ट्केरलबगं ळूरुZero-shot transliteration is possiblekozhikodekeralabengaluruPerform Telugu English transliterationeven if network has not seen that dataPre-requisite to Neural Transfer Learning: Represent all data in a commonscript

Indian Language Speech sound Label set(Samudravijaya & Murthy, 2012)Common set of phones and theirmappings to Indic scripts can bedefinedUseful for multilingual ASR, TTS, G2P(Schultz et al 2001; Abraham et al, 2014, Abraham et al,2016)

Phonetic Representation Represent each Indic characteras a feature vector Define a similarity measurebased on the feature vector Could be used fortransliteration, cognateidentification, spellingcorrection, etc.(Kondrak, 2001; Kunchukuttan, etal., 2016)

Orthographic Syllableakshara, the fundamental organizing principle of Indian scriptsPseudo-Syllable(CONSONANT) VOWELTrue Syllable Onset, Nucleus and CodaOrthographic Syllable Onset, NucleusExamples: की (kI), प्रे (pre)Syllable as the basictransliteration unit(Atreya et. al . 2015)Hindi Kannada English वव द्या ल यವಿ ದ್ಯಾ ಲ ಯvi dya lay अ जजय नಅ ರ್ುು ನa rju n

IndicNLP LibraryUtilize similarity between Indian languages for scaling NLP applications tomultiple Indian languages Text Normalizer Syllabification Query Script Information Phonetic ic nlp library Script Converter Romanization Indicization

Lexical Similarity

Lexical Similarity(Words having similar form and meaning) Cognates Named Entitiesa common etymological origindo not change across languagesroTI (hi)roTlA (pa)breadmu.mbaI (hi)mu.mbaI (pa)mu.mbaI (pa)bhai (hi)bhAU (mr)brotherkeral (hi)k.eraLA (ml)keraL (mr) Loan Words Fixed Expressions/Idiomsborrowed without translationmatsya (sa)pazha.m (ta)MWE with non-compositional semanticsmatsyalu (te) fishdAla me.n kuCha kAlA honA(hi)phala (hi)dALa mA kAIka kALu hovu(gu)fruitEnables sharing of data across languagesSomething fishy

How similar are Indian Languages?Estimate lexical similarity fromparallel corpusLongest Common Subsequence Ratio (LCSR)for a sentence pair𝐿𝐶𝑆𝑅 𝑠1 , 𝑠2 𝐿𝐶𝑆(𝑠1 , 𝑠2 )max 𝑙𝑒𝑛 𝑠1 , 𝑙𝑒𝑛 𝑠2LCSR for a language pair𝐿𝐶𝑆𝑅 𝐿1 , 𝐿2 1 𝐿𝐶𝑆𝑅(𝑠1 , 𝑠2 ) 𝑃(𝐿1 , 𝐿2 ) 𝑠1 ,𝑠2 𝑃(𝐿1 ,𝐿2 )Computed on ILCI corpus

Indian-Indian Language subword-level MT(Kunchukuttan & Bhattacharyya, 2016; Siripragada et al, 2020)Fine vocab segmentations like BPE andSentencePiece are popular for NMT Can learn translation models with less data Balance between utilizing lexical similarity and word-level information

Pivot SMT between Indian Languages(Kunchukuttan & Bhattacharyya, 2017)Related languages Use subword level translation unitsTranslation through intermediate language Use Pivot based SMT methodsCombine the two approaches

Transfer learning for En-IL NMT(Zoph et al., 2016; Nguyen et al., 2017; Lee et al., 2017; Dabre et al., 2018)We want Gujarati English translation but little parallel corpus is availableWe have lot of Marathi English parallel erGujaratiTrain models at the r

Make Indian Language Representations esSharedEncoderSharedAttentionMechanismGujarati Script Conversion/Transliteration (Dabre et al., 2018) Word-by-word translation Word-by-word translation with rescoring using an LMEnglishDecoder

Transfer Learning works best for related languagesEncoder Representations clusterby language family(Kudungta et al, 2019)

English Indian LanguagesHow do we support multiple target languages with a single decoder?A simple trick!: Append input with special token indicating the target languageOriginal Input: France and Croatia will play the final on SundayModified Input: France and Croatia will play the final on Sunday hin EForward MT SystemLHEStill an open problem

Backtranslation via Multilingual ModelLEL’Backward Multi-lingualMT SystemE’Forward MT SystemL’E’ExperimentBLEUBaseline Bilingual19.7(2) Baseline Multilingual E X22.3(2) bilingual backtranslation26.1(2) multilingual backtranslation27.0English Spanish with English French as helper pair

Multilingual Pre-trained NLU modelsTransformer encoder with masked LM objective – i.e. try to predict masked wordsConcat data from all languagesHow can we explicitly modellanguage relatedness?How can language relatednessassumptions speedup pretraining? Multilingual BERT (Devlin et al., 2018) - Wikipedia XLM-R ( Conneau et al , 2019) – CommonCrawl iNLTK - Wikipedia

Large-scale corpora and Evaluation setsSome recent effortsOSCAR Corpus: CommonCrawl (Suarez et al, 2020) https://oscar-corpus.com/AI4Bharat Corpus: News websites (Kunchukuttan et al, 2020) https://github.com/ai4bharat-indicnlp/indicnlp corpusWe also need Indian English content to overcome domain mismatchesEvaluation Few datasets for NLU tasks Mostly represent high resource languages like Hindi and Telugu Need datasets spanning all major languages WikiAnnNER (Pan et al, p corpus iNLTK News Headlines AI4Bharat News Articles (Kunchukuttan et al, 2020) Catalog

Syntactic Similarity

Syntactic Similarity between Indian languges Almost all Indian languages has SOV word order SOV word order determines relative order between: Noun-adpositionNoun-genitiveNoun-Relative clauseVerb-Auxilary Word order plays a very important role in most NLP applications Language Modelling Machine Translation Relatively Free Word Order

Angla-BharatiHindi Generator(Sinha et al., 1995)English Parsing &AnalyserPseudo-target for IndiclanguagesMarathi GeneratorEnglish Analyzer is shared across Indian languagesTamil GeneratorCommon Pseudo-target for all Indic languages generatedCan generate specialized pseudo-target for language groupse.g. Indo-Aryan, Dravidian

Source reordering for SMT(Kunchukuttan et al., 2014)Change order of words in input sentence to match word order in the target languageBahubali earned more than 1500 crore rupees at the boxofficeBahubali the boxoffice at 1500 crore rupees earnedबाहजबली ने बॉक्सओकिस पर 1500 करोड रुपए किाएA common set of rules canbe written for all IndianlanguagesRules from (Ramanathan et al.2008, Patel et al. 2013) for Hindi.https://github.com/anoopkunchukuttan/cfilt preorder

Bridging Word-order Divergence for low-resource NMT(Rudramurthy et al., 2019)(1) E H to G’- H corpus by word ttentionMechanismHindiDecoderGujaratiLittle G H corpus(2) Train with G’ H(3) Fine-tune with G’ HCannot ensure similar Gujarat and English words have similar representationsSolution: Pre-order English sentence to match Gujarati word-order

Exploiting syntactic similarity in IL-IL translationCan reduce search choices and errors, improve decoding speedRMT: No need to handle long-distance reordering.- Anusaaraka (Bharati et al. 2003)- Sampark (Antes, 2010)SMT: Monotonic Decoding, subword models.NMT: Local attention between encoder and decoder. (Luong et al., 2015)

Language Relatedness can be successfully utilizedbetween languages where contact relation exists

Source reordering for SMT using Hindi-driven rulesAddressing syntactive divergence in NMT using Hindi-driven rulesExperimentBLEUBaseline12.91 Hindi as helper language16.25Tamil to English NMT with transfer-leaning using Hindi

Summary Utilizing language relatedness is important to scale NLP technologiesto a large number of Indian languages. The orthographic similarity of Indian languages is a strong startingpoint for utilizing language relatedness. Contact as well as genetic relatedness are useful in the context ofIndian languages. Multilingual pre-trained models trained on large corpora needed fortransfer learning in NLU and NLG tasks. Efficient training and inference needed to experiment with moremodels that utilize language relatedness.

Thank [email protected]://anoopk.in

References48

1.2.3.4.5.6.7.8.9.10.11.16.17.18.19.20.21.Bharati, A., Chaitanya, V., Kulkarni, A. P., Sangal, R., & Rao, G. U. (2003). ANUSAARAKA: overcoming the language barrier in India. arXivpreprint cs/0308018.Anthes, G. (2010). Automated translation of indian languages. Communications of the ACM, 53(1), 24-26.Atreya, A., Chaudhari, S., Bhattacharyya, P., and Ramakrishnan, G. (2016). Value the vowels: Optimal transliteration unit selection formachine. In Unpublished, private communication with authors.Basil Abraham, S Umesh and Neethu Mariam Joy. "Overcoming Data Sparsity in Acoustic Modeling of Low-Resource Language by Borrowing Dataand Model Parameters from High-Resource Languages”, Interspeech, 2016.Basil Abraham, Neethu Mariam Joy, Navneeth K and S Umesh. "A data-driven phoneme mapping technique using interpolation vectors ofphone-cluster adaptive training." Spoken Language Technology Workshop (SLT), 2014.Collins, M., Koehn, P., and Kučerová, I. (2005). Clause restructuring for statistical machine translation. In Annual meeting on Association forComputational Linguistics.Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F