Error Detection and Correction in Annotated Corpora

Error Detection and Correction in Annotated Corpora
Author: Markus Dickinson
Publisher:
Total Pages:
Release: 2005
Genre: Computational linguistics
ISBN:


Download Error Detection and Correction in Annotated Corpora Book in PDF, Epub and Kindle

Abstract: Building on work showing the harmfulness of annotation errors for both the training and evaluation of natural language processing technologies, this thesis develops a method for detecting and correcting errors in corpora with linguistic annotation. The so-called variation n-gram method relies on the recurrence of identical strings with varying annotation to find erroneous mark-up. We show that the method is applicable for varying complexities of annotation. The method is most readily applied to positional annotation, such as part-of-speech annotation, but can be extended to structural annotation, both for tree structures---as with syntactic annotation---and for graph structures---as with syntactic annotation allowing discontinuous constituents, or crossing branches. Furthermore, we demonstrate that the notion of variation for detecting errors is a powerful one, by searching for grammar rules in a treebank which have the same daughters but different mothers. We also show that such errors impact the effectiveness of a grammar induction algorithm and subsequent parsing. After detecting errors in the different corpora, we turn to correcting such errors, through the use of more general classification techniques. Our results indicate that the particular classification algorithm is less important than understanding the nature of the errors and altering the classifiers to deal with these errors. With such alterations, we can automatically correct errors with 85% accuracy. By sorting the errors, we can relegate over 20% of them into an automatically correctable class and speed up the re-annotation process by effectively categorizing the others.

Automatic Treatment and Analysis of Learner Corpus Data

Automatic Treatment and Analysis of Learner Corpus Data
Author: Ana Díaz-Negrillo
Publisher: John Benjamins Publishing Company
Total Pages: 322
Release: 2013-12-15
Genre: Language Arts & Disciplines
ISBN: 9027270953


Download Automatic Treatment and Analysis of Learner Corpus Data Book in PDF, Epub and Kindle

This book is a critical appraisal of recent developments in corpus linguistics for the analysis of written and spoken learner data. The twelve papers cover an introductory critical appraisal of learner corpus data compilation and development (section 1); issues in data compilation, annotation and exchangeability (section 2); automatic approaches to data identification and analysis (section 3); and analysis of learner corpus data in the light of recent models of data analysis and interpretation, especially recent automatic approaches for the identification of learner language features (section 4). This collection is aimed at students and researchers of corpus linguistics, second language acquisition studies and quantitative linguistics. It will significantly advance learner corpus research in terms of methodological innovation and will fill in an important gap in the development of multidisciplinary approaches (for learner corpus studies).

Computational Methods for Corpus Annotation and Analysis

Computational Methods for Corpus Annotation and Analysis
Author: Xiaofei Lu
Publisher: Springer
Total Pages: 192
Release: 2014-07-08
Genre: Language Arts & Disciplines
ISBN: 9401786453


Download Computational Methods for Corpus Annotation and Analysis Book in PDF, Epub and Kindle

In the past few decades the use of increasingly large text corpora has grown rapidly in language and linguistics research. This was enabled by remarkable strides in natural language processing (NLP) technology, technology that enables computers to automatically and efficiently process, annotate and analyze large amounts of spoken and written text in linguistically and/or pragmatically meaningful ways. It has become more desirable than ever before for language and linguistics researchers who use corpora in their research to gain an adequate understanding of the relevant NLP technology to take full advantage of its capabilities. This volume provides language and linguistics researchers with an accessible introduction to the state-of-the-art NLP technology that facilitates automatic annotation and analysis of large text corpora at both shallow and deep linguistic levels. The book covers a wide range of computational tools for lexical, syntactic, semantic, pragmatic and discourse analysis, together with detailed instructions on how to obtain, install and use each tool in different operating systems and platforms. The book illustrates how NLP technology has been applied in recent corpus-based language studies and suggests effective ways to better integrate such technology in future corpus linguistics research. This book provides language and linguistics researchers with a valuable reference for corpus annotation and analysis.

Handbook of Linguistic Annotation

Handbook of Linguistic Annotation
Author: Nancy Ide
Publisher: Springer
Total Pages: 1440
Release: 2017-06-16
Genre: Language Arts & Disciplines
ISBN: 9402408819


Download Handbook of Linguistic Annotation Book in PDF, Epub and Kindle

This handbook offers a thorough treatment of the science of linguistic annotation. Leaders in the field guide the reader through the process of modeling, creating an annotation language, building a corpus and evaluating it for correctness. Essential reading for both computer scientists and linguistic researchers.Linguistic annotation is an increasingly important activity in the field of computational linguistics because of its critical role in the development of language models for natural language processing applications. Part one of this book covers all phases of the linguistic annotation process, from annotation scheme design and choice of representation format through both the manual and automatic annotation process, evaluation, and iterative improvement of annotation accuracy. The second part of the book includes case studies of annotation projects across the spectrum of linguistic annotation types, including morpho-syntactic tagging, syntactic analyses, a range of semantic analyses (semantic roles, named entities, sentiment and opinion), time and event and spatial analyses, and discourse level analyses including discourse structure, co-reference, etc. Each case study addresses the various phases and processes discussed in the chapters of part one.

Multilayer Corpus Studies

Multilayer Corpus Studies
Author: Amir Zeldes
Publisher: Routledge
Total Pages: 266
Release: 2018-07-11
Genre: Language Arts & Disciplines
ISBN: 1351622137


Download Multilayer Corpus Studies Book in PDF, Epub and Kindle

This volume explores the opportunities afforded by the construction and evaluation of multilayer corpora, an emerging methodology within corpus linguistics that brings about multiple independent parallel analyses of the same linguistic phenomena, and how the interplay of these concurrent analyses can help to push the field into new frontiers. The first part of the book surveys the theoretical and methodological underpinnings of multilayer corpus work, including an exploration of various technical and data collection issues. The second part builds on the groundwork of the first half to show multilayer corpora applied to different subfields of linguistic study, including information structure research, referentiality, discourse models, and functional theories of discourse analysis, synthesizing these different discussions in a detailed case study of non-standard language in its concluding chapter. Advancing the multilayer corpus linguistic research paradigm into new and different directions, this volume is an indispensable resource for graduate students and researchers in corpus linguistics, syntax, semantics, construction studies, and cognitive grammar.

Corpus Annotation

Corpus Annotation
Author: Roger Garside
Publisher: Routledge
Total Pages: 304
Release: 1997
Genre: Computers
ISBN:


Download Corpus Annotation Book in PDF, Epub and Kindle

This is a text which surveys the growing field of research known as corpus annotation - an electronic collection of texts. Corpus annotation is a central resource in linguisticsi̧nformation technology and the processing of human language. The book seeks to show the nature of language and the most effective means of analysing it. A bibliography lists relevant e-mail addresses and Web sites.

Errors and Disfluencies in Spoken Corpora

Errors and Disfluencies in Spoken Corpora
Author: Gaëtanelle Gilquin
Publisher: John Benjamins Publishing
Total Pages: 180
Release: 2013-05-29
Genre: Language Arts & Disciplines
ISBN: 9027271798


Download Errors and Disfluencies in Spoken Corpora Book in PDF, Epub and Kindle

The papers brought together in this volume illustrate how spoken corpora (be they native or learner corpora) can provide insights into various aspects of errors and disfluencies such as pauses and discourse markers. They show, among others, that such phenomena can be influenced by factors like gender, age or genre, and that they can correlate with, e.g., informativeness and syntactic complexity. Crucially, they also demonstrate that items which are often dismissed as mere disfluencies can fulfil important functions and thus play an essential role in the management of spoken discourse. The book should appeal to linguists who are interested in spoken language in general and in errors and disfluencies in speech in particular, as well as to specialists in second language acquisition and language testing who want to know more about the nature of fluency and accuracy. Originally published in International Journal of Corpus Linguistics 16:2 (2011)

Artificial Neural Networks - ICANN 2001

Artificial Neural Networks - ICANN 2001
Author: Georg Dorffner
Publisher: Springer
Total Pages: 1248
Release: 2003-05-15
Genre: Computers
ISBN: 3540446680


Download Artificial Neural Networks - ICANN 2001 Book in PDF, Epub and Kindle

This book is based on the papers presented at the International Conference on Arti?cial Neural Networks, ICANN 2001, from August 21–25, 2001 at the - enna University of Technology, Austria. The conference is organized by the A- trian Research Institute for Arti?cal Intelligence in cooperation with the Pattern Recognition and Image Processing Group and the Center for Computational - telligence at the Vienna University of Technology. The ICANN conferences were initiated in 1991 and have become the major European meeting in the ?eld of neural networks. From about 300 submitted papers, the program committee selected 171 for publication. Each paper has been reviewed by three program committee m- bers/reviewers. We would like to thank all the members of the program comm- tee and the reviewers for their great e?ort in the reviewing process and helping us to set up a scienti?c program of high quality. In addition, we have invited eight speakers; three of their papers are also included in the proceedings. We would like to thank the European Neural Network Society (ENNS) for their support. We acknowledge the ?nancial support of Austrian Airlines, A- trian Science Foundation (FWF) under the contract SFB 010, Austrian Society ̈ for Arti?cial Intelligence (OGAI), Bank Austria, and the Vienna Convention Bureau. We would like to express our sincere thanks to A. Flexer, W. Horn, K. Hraby, F. Leisch, C. Schittenkopf, and A. Weingessel. The conference and the proceedings would not have been possible without their enormous contri- tion.

Computational Linguistics and Intelligent Text Processing

Computational Linguistics and Intelligent Text Processing
Author: Alexander Gelbukh
Publisher: Springer
Total Pages: 598
Release: 2013-03-12
Genre: Computers
ISBN: 3642372473


Download Computational Linguistics and Intelligent Text Processing Book in PDF, Epub and Kindle

This two-volume set, consisting of LNCS 7816 and LNCS 7817, constitutes the thoroughly refereed proceedings of the 13th International Conference on Computer Linguistics and Intelligent Processing, CICLING 2013, held on Samos, Greece, in March 2013. The total of 91 contributions presented was carefully reviewed and selected for inclusion in the proceedings. The papers are organized in topical sections named: general techniques; lexical resources; morphology and tokenization; syntax and named entity recognition; word sense disambiguation and coreference resolution; semantics and discourse; sentiment, polarity, subjectivity, and opinion; machine translation and multilingualism; text mining, information extraction, and information retrieval; text summarization; stylometry and text simplification; and applications.