This paper explores how linguistic data annotation can be made (semi-)automatic by means of machine learning. More specifically, we focus on the use of “contextualized word embeddings” (i.e.... Show moreThis paper explores how linguistic data annotation can be made (semi-)automatic by means of machine learning. More specifically, we focus on the use of “contextualized word embeddings” (i.e. vectorized representations of the meaning of word tokens based on the sentential context in which they appear) extracted by large language models (LLMs). In three example case studies, we assess how the contextualized embeddings generated by LLMs can be combined with different machine learning approaches to serve as a flexible, adaptable semi-automated data annotation tool for corpus linguists. Subsequently, to evaluate which approach is most reliable across the different case studies, we use a Bayesian framework for model comparison, which estimates the probability that the performance of a given classification approach is stronger than that of an alternative approach. Our results indicate that combining contextualized word embeddings with metric fine-tuning yield highly accurate automatic annotations. Show less
The term ‘meaning’, as it is presently employed in Linguistics, is a polysemous concept, covering a broad range of operational definitions. Focussing on two of these definitions, meaning as ... Show moreThe term ‘meaning’, as it is presently employed in Linguistics, is a polysemous concept, covering a broad range of operational definitions. Focussing on two of these definitions, meaning as ‘concept’ and meaning as ‘context’ (also known as ‘distributional semantics’), this paper explores to what extent these operational definitions lead to converging conclusions regarding the number and nature of distinct senses a polysemous form covers. More specifically, it investigates whether the sense network that emerges from the principled polysemy model of over as proposed by Tyler & Evans (2003; 2001) can be reconstructed by the neural language model BERT. The study assesses whether the contextual information encoded in BERT embeddings can be employed to succesfully (i) recognize the abstract sense categories and (ii) replicate the relative distances between the senses of over proposed in the principled polysemy model. The results suggest that, while there is partial convergence, the two models ultimately lead to different global abstractions because the imagistic information that plays a key role in conceptual approaches to prepositional meaning may not be encoded in contextualized word embeddings. Show less
This squib briefly explores how contextualized embeddings – which are a type of compressed token-based semantic vectors – can be used as semantic retrieval and annotation tools for corpus-based... Show moreThis squib briefly explores how contextualized embeddings – which are a type of compressed token-based semantic vectors – can be used as semantic retrieval and annotation tools for corpus-based research into constructions. Focusing on embeddings created by the Bidirectional Encoder Representations from Transformer model, also known as ‘BERT’, this squib demonstrates how contextualized embeddings can help counter two types of retrieval inefficiency scenarios that may arise with purely form-based corpus queries. In the first scenario, the formal query yields a large number of hits, which contain a reasonable number of relevant examples that can be labeled and used as input for a sense disambiguation classifier. In the second scenario, the contextualized embeddings of exemplary tokens are used to retrieve more relevant examples in a large, unlabeled dataset. As a case study, this squib focuses on the INTO-INTEREST construction (e.g. I’m so into you). Show less
The aim of this short paper is to extend the application of embedding-based methodologies beyondthe realm of lexical semantic change. It focuses on the use of unsupervised BERT-embeddings... Show moreThe aim of this short paper is to extend the application of embedding-based methodologies beyondthe realm of lexical semantic change. It focuses on the use of unsupervised BERT-embeddings anduncertainty measures (Classification Entropy), and assesses whether (and how) they can be usedto (semi-)automatically flag possible functional-semantic changes in the use of the construction [BEabout] in the Corpus of Historical American English (COHA). Show less