A dictionary is a description of the vocabulary of a language. It explains what words mean, and shows how they work together to create meanings and form sentences. But where do lexicographers – the people who write dictionaries – get their information from?
There are two main sources of information about words: introspection and observation.
Introspection means 'looking inside' your own brain and trying to remember everything you know about a word. Observation means examining real examples of language in use (in newspapers, novels, blogs, tweets, and so on), so that we can observe how people use words when they are communicating with one another.
It's obvious that a fluent speaker of a language must already know a lot about that language's vocabulary. So introspection can be a useful source of insights about what words mean and how they are used. But a dictionary has to give a complete and well-balanced account of a word's behaviour, and introspection alone can never provide enough information for this purpose. Consequently, lexicographers – since the time of Samuel Johnson in the 18th century – have preferred to base their dictionaries on observation. In Johnson's time, observing language was a laborious business: it meant reading hundreds of books and extracting good examples of words in use. But today's computer technology makes all this much easier. And it gives us access to so much good language data that we are now able to provide a really reliable account of English vocabulary.
For over 250 years, lexicographers have used citations – examples of words in use, taken from books or other sources – as a basis for describing language. This article from our BuzzWord archive, explaining the noun plogging, includes citations from three differenct online sources.
This kind of data is particularly useful for keeping track of changes in the language, and for spotting new words and phrases as they come into use. Our sources have now broadened to include not just books and newspapers, but language used on the internet too. So when our blog discussed the use of handbags as an adjective, most of the citations came not from 'traditional' media but from tweets and other postings on social networks.
Citations still have a useful role to play, but our main source of language data is the corpus. A corpus is a collection of thousands of different 'texts' stored on a computer or in the cloud. These texts include novels, academic books and papers, newspapers, magazines, recorded conversations and broadcast interviews, blogs, online journals and discussion groups, and much more. The point of using a corpus is that we can't observe all the English that is being used by millions (or even billions) of people all over the world, so instead we look at a representative sample of English texts. Using intelligent digital tools (see more on that below) we can find every example in the corpus of a particular word, phrase, grammatical pattern, or collocation. It is this information which forms the basis for everything we say about words in the dictionary.
Lexicographers use powerful computer programs to extract information from language corpora. The best-known type of software for analyzing a corpus is called a 'concordancer'. A concordancer looks through the whole corpus and finds every instance of a particular word or phrase then displays it with its immediate context – the seven or eight words on either side of it. This is called a concordance. The most important thing for lexicographers is to identify recurrent patterns: in other words, any feature which occurs not just once but many times.
For example, for the word remember, a concordance shows grammatical patterns where it is used with a verb in the –ing form (or gerund). Examples of the same construction from a concordance would be as follows:
Reading the concordance of the verb remember, a lexicographer would easily identify several other patterns which are typical of the way remember is used. These include:
By scanning hundreds (sometimes thousands) of examples like this we gradually build up a picture of the most important facts about a word like remember. However, this is very time-consuming. When lexicographers first started using corpus data, in the 1980s, corpora were relatively small, with just 10 or 20 million words of text. Consequently, the number of examples for a particular word (like remember) would also be fairly small – so it was possible to look at them all. But today's corpora often contain billions of words of text, so it is no longer possible to look at every instance of a common word like remember.
Fortunately, intelligent tools solve this problem of 'information overload'. In addition to concordances we now look at 'Word Sketches', which provide an efficient one-page summary of all the key facts about a word and about the other words it regularly combines with.
How does a Word Sketch work? The program first collects all the instances of the word being investigated – just as a concordancer does. Then it applies a second stage of analysis. This time, the software looks at particular grammatical relationships. In the case of the noun evidence for example, it finds all the sentences where evidence is the object of a verb, then identifies the most frequent verbs used in this pattern. These are the verbs that are listed in a column of a Word Sketch: people often talk (or write) about giving evidence, finding evidence, presenting evidence, or gathering evidence. Similarly, another column – headed 'modifiers of evidence' – is a list of the adjectives that most frequently modify this noun: we may say there is little evidence for something, or talk about complelling evidence, or scientific evidence. The Word Sketch also provides a link to a concordance showing all the sentences in which evidence appears in a particular pattern.
This tool has made lexicographers' lives easier, while at the same time supplying us with information which is more accurate and more detailed. Programs like this are now standard tools for lexicography, but the Word Sketch software was pioneered by Macmillan Education and used in producing the first edition of the Macmillan English Dictionary.
Dictionaries don't just tell you what words mean, they also explain how words are used. And the corpus provides us with the evidence we need to fulfil these two functions.
Many words have more than one meaning, but it is almost always clear which meaning the speaker or writer intends. In these four sentences from the corpus, it is easy to see when the word goal is being used in its footballing meaning, or when it means an aim or objective:
We identify the 'right' meaning through the context the word appears in – and this is exactly what we do when we read or hear something. By studying words in context, we discover how many different meanings they have.
We saw how the concordance for remember tells us a lot about the grammatical patterns the verb is used in: with a gerund, a that-clause, an infinitive, and so on. Here again, the Word Sketches provide a useful shortcut by listing the most frequent 'constructions' – so we no longer need to scan thousands of examples. A Word Sketch of grammar patterns for the verb decide shows that the most frequent pattern with decide is an infinitive clause (e.g. Three months after that they decided to terminate my employment on health grounds.). The next most common pattern is with a that-clause (e.g. They decided that surrender was the only sensible option), and so on.
The Word Sketch tool provides high-quality information about collocations, or words that have a tendency to go together. Using this tool means we can give a really comprehensive account of collocation for the first time. This is of great value to anyone for whom English is a second language, because collocation is a key to expressing your ideas in ways that sound natural and typical.
In this entry for the word importance, frequent collocations are shown in two ways:
All the words we've looked at so far (remember, decide, evidence, importance) can be used in any situation: you might use them in a conversation, read them in a newspaper, or see them in an academic journal. They are what linguists called 'unmarked' because they belong to the basic vocabulary of the language. But there are some words and expressions which are mainly found in one particular type of text: in spoken language, for example, or in newspapers or technical writing. Similarly, most English words are used all over the English-speaking world, but some belong to one particular regional variety of English, such as British English or Indian English.
Look at this sentence from the corpus:
Eatery is another word for 'restaurant' – but it is not 'unmarked'. When we look at all the examples of eatery in the corpus we find that a majority come from newspapers and magazines, and most of these newspapers and magazines are from the US. So in the dictionary, the word eatery has two 'labels': mainly American and mainly journalism. It is the evidence of the corpus which enables us to apply labels like this with confidence.
In language, the more frequent something is, the more useful it is to learn. The words ameliorate and improve mean more or less the same – but improve is about 250 times more common. It is worth learning improve – its meaning, grammar, and collocations – because it is part of the 'core' vocabulary of English: you will see and hear it frequently, and you will probably need to use it quite often too. Ameliorate is not like this: if you happen to come across it (which is unlikely, because it is very rare), you can look it up in a dictionary to find out what it means, but it is not the kind of word you need to spend a lot of time learning.
With a very large corpus, it is easy to identify not only which words are most frequent, but also which grammar patterns (like decide + infinitive) and which collocations (like crucial + importance) are most frequent too. It is these frequent words and combinations which we explain in most detail in Macmillan Dictionary, and the distinction we make between 'red' and 'black' words is one of the unique features of our dictionary.
Dictionary users appreciate example sentences. A good example sentence is one that shows how a word works in context, and helps to explain what it means. An example for a word in a dictionary should be typical of the way the word is used in real life – so we use the corpus as a source for these.
To see how the selection process works, look back at the entry for importance above. We use the Word Sketch and concordances to identify the facts about the word which are most worth including in the dictionary (and in this case, that includes various common collocations). But notice the first example: By 1800, the monarchy had declined in importance.
We chose this example because in the corpus we find many hundreds of instances of the expression in importance with a verb in front of it. This means it is one of the typical features of the way importance is used. Further research shows that the verbs which occur in this position are usually words like increase, grow, and gain, or decline, decrease, and diminish.
Among the many instances of this pattern, we find several which illustrate the sequence: By [date] X had declined in importance. One of these, for example, reads:
By the early 12th century, the monasteries, which had been the focal points of religious life, had declined in importance and the way was ready for the introduction of the diocesan system.
But this sentence is too long for the dictionary, and it contains a lot of unnecessary extra information. So we have changed the sentence a little, and shortened it to what you see in the dictionary:
By 1800, the monarchy had declined in importance.
This example is short and easy to understand – but it faithfully reflects the way importance is used in the corpus.
These pages provide further useful information about Macmillan Dictionary and dictionaries in general: