語料庫搜尋語言
Typing your own queries gives you greater control over what you search for!!
目前三大線上語料庫架構 online corpus architecture (Corpus Workbench, BNCWeb/CQPWeb, Word Sketch Engine) 主要使用的兩種句法。
Corpus Query Language
(ref. Hoffmann et al. (2008))
Query languages
used to perform linguistic searches in corpus.This is particularly important when we want to extract lexico-grammatical patterns.
There are two formats that can be used to enter queries in
BNCweb
corpus: theSimple Query Syntax (SQS)
and theCorpus Query Syntax (CQP)
. CQP is more verbose than that of SQS.
Simple Query Syntax
Easy to search for a particular word form or phrase in the entire corpus.
Be aware of the meta-characters (wildcards) , which have a special function in the query language, (see overview below) and must be escaped by preceding them with a backslash (\) if they are used literally. E.g.,
\?
will get?
Note: words and punctuation symbols are treated as separate tokens; searches are case-insensitive by default.
Contracted forms are split.
To find the patterns, use wirdcards expressions.
First type (Simple query wildcards)
在 SQS 的定義下,所謂的 wildcard 就是有特定功能的標點符號 (punctuation symbols with a special function)。例如:
You could also combine multiple wildcards such as *oo+oo*
to find 'Voodoo, shoolroom', etc.
Wildcards can be used freely among the items of a phrase query, but they only apply to single word tokens and do not match across multiple tokens.
Hence, black*white
matches 'black-and-white', but not 'black and white'.
Second type: Separate any number of alternatives with commas, and enclose them in square brackets. E.g.,
finds both the BE and AE spelling.??+[able,ability]
will match 'capable, capability, availability', etc.
Matching POS
Search for a word form with a specific POS tag by linking them with an underscore
_
. Wildcards can be applied to both word form and POS tag.
POS can be searched alone.
Search for simplified POS tags with curly braces.
Headword and lemma queries
A HEADWORD is a set of wordforms consisting of a basic uninflected form and its inflectional variants. E.g., the headword WRITE represents the wordforms write, writes, wrote, writing and written.
Headwords do not distinguish between different word classes.Thus the headword PLAY covers both the wordforms of the verb (i.e., play, plays,played) and of the noun (i.e., play and plays).
In BNCweb, a lemma is defined as the combination of the HEADWORD and the SIMPLIFIED TAG for a given word. So the lemma
play_V
represents all the wordforms of the verb 'to play'. To search, use
Querying Word Sequences
Queries can consist of multiple words, e.g. 'talk of the town'.
All tokens (i.e., words and punctuation symbols) are separated by blanks; again possessives (Peter's) and contracted forms (they've, gonna) must be split:
Each query item in a sequence can make full use of wildcards, part-of- speech constraints, and headword or lemma searches:
Use
+
to skip an arbitrary token, or*
for an optional token. Combine+
and*
for larger gaps, e.g.+++**
to skip between 3 and 5 tokens.
Advanced lexico-grammatical patterns
Use
regular expression
notation (see below) for alternatives, optional elements and repetition within a sequence:
Regular expression notation can be nested to match complex patterns:
will find 'the biggest men', 'the most attractive man',etc.
Complex syntactic patterns can be formed, e.g. for a prepositional phrase:
will find "a preposition; followed by an optional article; followed by any number of adjectives (zero or more), each of which may optionally be preceded by an adverb; followed by a noun"
XML tags
Proximity queries
Special syntax for searching one item within a specified range of another:
Only the left element ("target") will be highlighted on the result page. The right element is considered as a "constraint" that must be satisfied.
Multiple constraints can be chained:
In this case, day must co-occur with month as well as year in a 5-token window; only day will be highlighted on the Query result page.
Proximity queries can be nested with parentheses:
the verb waste must co-occur with time as well as money in the same sentence; but time and money must be closer together (within a 3- token window). Again, only instances of waste will be highlighted.
Proximity queries cannot be combined with lexico-grammatical patterns!
Exercise (using BNCWeb)
To boldly split. Traditional prescriptive grammars advise against the use of split infinitives such as the famous to boldly go. Use BNCweb to find out how far actual usage in Present-day English conforms to this prescription.
Write a query that matches split infinitives, consulting to find the appropriate pos tags. How many split infinitives can you find in the BNC?
Compare this result to the number of prescriptively correct infinitives (boldly to go or to go boldly). Why can't you just search for the pattern
to <infinitive>
as a point of comparison?Are split infinitives used more often in spoken than in written English?
Can you extend your queries to also find (split) infinitives with complex adverbs, such as to at least consider and to sort of say?
Corpus Query Syntax
The Corpus Query Syntax (known as CQP Query Syntax) was developed at the IMS, Uni. of Stuttgart in the early 1990 . The CQP as used in Word Sketch Engine is an extension to the original language.
》更強大的搜尋句法可以造就提出研究問題的深度!
Instead of wildcards, CQP makes use of Regular Expressions to search for
generalizations (e.g., 'all words that begin with super-')
patterns (e.g., 'all words that fit the pattern imp_ss_ble')
varieties of lexico-grammatical patterns.
Regular Expression 正則表示法
Regex
is the compact notation for describing repetition, optionality and alternatives in sequences of characters (word forms and annotations) or sequences of tokens (e.g., lexico-grammatical patterns).It is widely used in computational linguistics for searching and analyzing textual data. Harder but more powerful than Simple Query Syntax.
finds the nominalizations ending in -ness, -ity, -ment and -tion as well as their plural forms -nesses, -ities, -ments and -tions.
Note !
regex
in CQP are always case-sensitive, you must add the "ignore case" modifier%c
so that"s.ng"%c
matches the Song.regex
in CQP is used both at the level of characters and at the level of tokens.So
super.+
matches a word beginning with super- followed by one or more arbitrary characters, while[pos = "AT0"]?[pos = "AJ."]+[pos = "NN."]
matches a sequence of an optional article, one or more adjectives, and a common noun.SQS allows regex notation only at the level of tokens, while simpler wildcards are used at the level of characters. E.g.,
super+
for words beginning with "super".There are usually many equally valid ways of looking for the same information. E.g., the following two regex searching for words with more than three vowls in a row , are equivalent:
There are different dialects of regular expression syntax.CQP implements a version known as POSIX regular expression.
Exercise
a). Write a regex query to find words that follow the orthographic pattern VCCVCCVCCVCC..., i.e., at least four repetitions of a group that is formed by a vowel followed by exactly two consonants. Use character class to match the consonants and vowels.
Comparisons
Matching arbitrary substrings (with comparison with SQS)
Repitition operators for multi-character substrings:
舉個例中說明為何 CQP 比較厲害
Attributes
Each token in the BNC is annnotated with a POS tag, headword and various other linguistic information, referred to as ATTRIBUTES of a token.
Token attributes in BNCweb
Unlike SQS, a corpus query in CQP allows you to access all these attributes in a consistent way.
such combination of attribute name and regex is called a constraint, and the complete expression in square brackets - which specifies one or more constraints on a single token - is referred to as a token expression.
A token expression consists of matches an attribute name on the left, an operator (such as
=
), and on the right in quotation marks a pattern (or regex) that the attribute value has to match .The entire expression is enclosed in square brackets[...]
indicating that it refers to a single token.Conditions can be combined by using a Boolean operator.
finds the word can (case-insensitive) tagged as anything but a modal verb (i.e., as a noun or lexical verb)
Lexico-grammatical patterns and text strcture
Make use of the automatic translation of SQS into CQP in the QUERY HISTORY function !!!
Advanced features of CQP queries
CQP in Word Sketch Engine
A query consists of a
regular expression
overattribute expressions
and/orstructures
.The attributes could be words and tag.
Practice: COPENS and WSE.
BNC tagset can be found here. Also note that pos in corpus has been assigned by automatic tagger and are not always correct.
IMS Corpus Workbench and its Corpus Query Processor can be found at http://cwb.souceforge.net
In computer terminology, a character is a unit of information that roughly corresponds to a grapheme, grapheme-like unit, or symbol. Examples of characters include letters, numerical digits, common punctuation marks (such as "." or "-"), and whitespace.
Last updated