Module 1: Understanding technical vocabulary (including controlled vocabularies and ontologies)
Objectives: Vocabulary is the key to understanding biomedical and health research papers. In this module we will develop an understanding of what constitutes the technical vocabulary of our own specialty. We will learn basic linguistic terminology including corpus, type, token, keyness, General Service List, Academic Word List, off-list, n-gram.
Practicum: in the course of this module, students will analyze a research article of their choice, determining the word count (tokens) and number of different words (types) in the article. They will classify words (types) as being from the General Service List or the Academic Word list, and they will identify “off-list” types which are likely to be technical terms in their specialty. They will also identify commonly recurring phrases within the article which they have chosen.
Much of the technical vocabulary in biomedicine and health either is or comes from Latin or Greek. There was a time in the English speaking world when school children studied these languages in order to better understand general English, and so English speaking students entering the health sciences might have made an easier transition into biomedical terminology. However, the classical languages of Latin and Greek are no longer widely taught before university so that today biomedical language is as much a foreign tongue to the native English speaker as it is to their Chinese or Malay counterpart.
Key point: Biomedical Language is a second language for all of us.
This may explain why much of our initial education in the health sciences is basically a language lesson. This is obvious when you examine the text books used in first year biology and health sciences courses. The introductory chapters in particular are stuffed with definitions, technical terms are highlighted in the text, and most introductory texts feature exhaustive glossaries of technical terms. No comparable figures exist specifically for the health sciences, but one study found that in introductory lectures in biology and chemistry definitions occurred slightly more often than once every two minutes (1). Thus we understand that technical vocabulary acquisition is an important component of initiation into the biomedical and health community.
Not only is biomedical vocabulary important and rather cryptic, but there is rather a lot of it. To put things into perspective, literacy standards for school textbooks in North America are based on a target vocabulary of approximately 2,000 words which students are expected to acquire during primary school. A number of jurisdictions actually legislate levels of readability for particular sets of documents – especially those with legal or health significance – in order that they be understandable to someone with only a primary school education. At the other end of the spectrum, it is said that the average English-speaking university graduate probably knows about 20,000 word families, where a word family is a root word plus all of its derivatives e.g. run, runs, running, ran. By comparison, dictionaries of the English language will list from about 50,000 to about 100,000 word families. How many of these words does a person really need to learn?
Key point: The most basic level of fluency in general English requires a vocabulary of approximately 2,000 words.
In fact a few words go a long way. The word “the” is said to account for about 7% of the written English language. Only ten such commonly occuring “functional” words make up 24% of the English language. This looks promising, but from there on the additional coverage provided by learning new words falls off dramatically. With a vocabulary of 1,000 words you can read about 72% of the words on a page of unspecialized (non-medical) text. But you have to double your effort to 2,000 words to gain only an additional 8% and take yourself up to 80% coverage. Furthermore, these are not just any 2,000 words, they are the 2,000 most common words in the language (also known as the GSL – the General Service List). If you triple your effort and learn the 3,000 most common words in the English language, you only gain an additional 4% coverage to 84% and you will still not be able to understand most sentences that you encounter. This is because it is estimated that we need a vocabulary of about 15,000 words to cover 95% of the language (Fig 1), and it is this degree of coverage that we need to read and comprehend. In general, if we know 95% of the words on a page, we can guess the meaning of the other 5%.
Key point: Our primary target for biomedical fluency is to develop a vocabulary which lets us read and understand 95% of the words that we encounter within our area of interest.
The point about 95% coverage can be demonstrated by looking at a passage of text from which academic and technical words have been removed. Here is a passage from an article pertaining to biomechanics. All of the words which have been left in place are from the General Service List – the 2,000 most common word families in the English language, and so this gives you a good idea of how well you would understand the article if you did not know any academic or technical vocabulary. About 35% of the words are missing and the passage is virtually indecipherable!
The model for the system consists of 3 interacting . These are the system (structure of the , stiffness of the , , joint , and the properties of the ), active system ( properties of and ), and system ( and other control ). The 3 complement each other and work together to . The receives and force from both the and active systems, and this information for levels of to balance any forces.
Not very pretty is it! In the case of general English, with 2,000 English words we normally have about 80% coverage of text. In fact, in this case we have about 65% coverage. This is because words which are relatively common in general English are being displaced from this academic article by more specialized vocabulary. We seem to be a long, long way from getting the 95% coverage that we would like.
This is a widespread problem with academic writing, and linguists have stepped in to give us some help. A number of groups have tried to identify those words which are especially common in academic environments. From these efforts, one very useful word list which has emerged is the Academic Word List – a list of about 570 word families which commonly occur in academic documents. In fact,with the 2,000 most common words, plus the 570 word families of the AWL, one can achieve close to 90% coverage of most academic texts. That is equivalent to the coverage that we would get from otherwise memorizing 6,000 words. Expressed another way, we can say that by focusing our vocabulary acquisition, we reduce our work load by more than half.
If we put back in the words from the academic word list, we get something like this:
The model for the stabilizing system consists of 3 interacting . These are the passive system (structure of the , passive stiffness of the , , joint , and the passive properties of the ), active system ( properties of and ), and system ( and other control components). The 3 complement each other and work together to achieve stability. The receives and force from both the passive and active systems, and integrates this information for appropriate levels of to balance any forces.
Well, that is better but it is still rather disappointing. The AWL only gave us an additional 10% coverage, which means we are barely up to 75% coverage of this text, not the 90% that we are supposed to have or the 95% that we need. We are still a long way from having an understandable text. This suggests that for this particular paper, we need a very specialized vocabulary - and we can see that by filling in the missing 25% of the words. This is what we get:
The Panjabi model for the spinal stabilizing system consists of 3 interacting subsystems. These are the passive system (structure of the vertebrae, passive stiffness of the intervertebral discs, spinal ligaments, joint capsules, and the passive properties of the muscles), active system (contractile properties of muscles and tendons), and neural system (proprioceptors and other neural control components). The 3 subsystems complement each other and work together to achieve stability. The neural subsystem receives positional and force feedback from both the passive and active systems, and integrates this information for appropriate levels of muscle activation to balance any destabilizing forces.
What we see is that texts related to our interest have a pattern of language usage which is quite different from general English. A large proportion of the essential words are quite specialized. In fact, more than half of the technical words in our sample passage are anatomical terms, and the remainder concern biomechanics. Without these words, the text is indecipherable. Furthermore, we are highly unlikely to encounter these words in the normal course of studying English.
Clearly then, we need to know almost all of the words in a passage of technical writing in order to understand it. If 5% or less of the words are unknown to us, we can usually infer their meaning from the context. You might think that a learner could simply look up new words in the dictionary as they encountered them, but this is actually impractical. First, the process is too time-consuming and, secondly, as a general rule, one needs about 10 exposures to a new word before it becomes internalized (in other words we are going to have to look it up 10 times before we remember it!). Furthermore, the biomedical literature uses a vocabulary which is quite different from general English so that many of the terms that we use in biomedical language are not found in general dictionaries, or general dictionaries do not explain their biomedical nuances. And so, to read fluently within our own speciality we need to have memorized the 2,000 word families of the General Service List, the 570 word families of the Academic Word List, and enough technical terms to bring coverage up to 95%. The question now becomes “How many technical words is enough?”
Key point: Each of us needs to find enough of the key words for our specialty (not for all of biomedicine and health) so that we can read and understand 95% of the words that we encounter.
The total biomedical and health vocabulary is, of course, huge and growing so that it is impossible to say how many technical words there are in all. On the other hand, there are some very well defined and useful word lists. In fact, the development of biomedicine and health may be characterized by the formalization of specialized word lists and ontologies. Early biologists were absorbed with the classification of species, beginning with the larger animals and plants, then later tackling microorganisms. Hence, we are all familiar with long lists of names of birds, flowers or bacteriae. Furthermore, when species are listed, the names are not normally arranged randomly. Rather, names of related species are placed together. Thus, the position of a species name in a list or schema actually reveals something about the meaning of the name. A word list in which the placement of the word reveals something about its meaning is called an ontology (Figure 2).
Figure 2: Representative ontology of the prokaryotic genus Staphylococcus
- Staphylococcus aureus
- Staphylococcus epidermidis
- Staphylococcus saprophyticus
The partial ontology shown in figure 1 is more than a list of names of bacteriae. It suggests to us that the 3 species of bacteriae listed are closely related and that they probably therefore share certain physiological characteristics. It follows from this that they share certain similarities with regard to diagnostic tests and antibiotic sensitivities. This is a lot of useful information to get from a simple list! This also demonstrates the general principle that, for someone entering a particular area of biomedicine and health, it is well worth being aware of and learning certain word lists.
Unfortunately, for the novice microbiologist, there are approximately 2,000 genuses of bacteriae, and many thousands of species names. Should a microbiologist be expected to memorize all of these? A similar dilemma faces all biomedical and health researchers and clinicians, regardless of their speciality. By way of example, one ontology lists approximately 8,500 words for different types of cancers.
It is said that adult learners of English (those who learn English as adults rather than as children) seldom achieve vocabularies of more than 5,000 words, which means that most adult learners go their whole lives and never have the skill to broadly read and understand general English. This might also lead us to suspect that most people who are trying to acquire the ability to read biomedical papers in “English” will never acquire the fluency that they seek if they follow conventional methods of language acquisition. This suggests that if you want to acquire real fluency in biomedical language then you cannot simply memorize ontologies relevant to your speciality. You need a more focused approach.
Certainly it would be nice to know all of the biomedical and health terms in the world, but this is not just unreasonable, it is impossible. Therefore, we have to prioritize our language learning and identify the words and phrases which are really important to the particular area within which we work. It would be most convenient then if, apart from exhaustive ontologies, there were a list of key words for biomedicine and health, or a list of key words for each sub-domain.
Key point: word lists and ontologies abound in biomedicine and health, but there are too many words to memorize. Therefore, we have to target the most useful words to learn.
In the past, the identification of core technical words was based primarily on intuition – teachers taught what they guessed would be the most important words. Now, however, we have the technology to identify quite reliably the essential terminology within a given sub-domain of biomedicine and health. This is achieved through the methods of corpus linguistics.
A corpus is a body of language selected and ordered according to specific linguistic criteria in order to be used as a representative sample of the target language. In other words, a corpus is a collection of the literature, and it may be analyzed by computer for word content, including word frequencies and groupings of words. We can build corpora for the different biomedical and health sub-domains and, using specialized computer software, make comparisons to general English. In this way we can identify the most commonly occuring keywords in our area of interest, and learn these keywords first. So far, this has been done for the fields of nursing (2), public health (3) and midwifery, and more corpora are being developed all of the time.
The public health corpus comprises research articles, editorials, commentaries and reviews from 4 public health journals (Table 1). These journals were selected because they publish on a broad range of topics in public health, they were accessible in electronic form, and they conform to standard American (The American Journal of Public Health, Public Health Nursing) and British (BMC Public Health, The Journal of Public Health) English. The corpus covered all issues published by the 4 journals during the year 2005. Table 1 shows the breakdown of papers per journal, as well as number of tokens (the total “word count”) and types (the number of different words).
Journal (2005 impact factor)
The American Journal of Public Health (3.566)
BMC Public Health (1.658)
The Journal of Public Health (1.031)
Public Health Nursing. (0.693)
Public Health sub-corpus
The words and phrases found in the public health corpus were compared to a corpus of general English. What were the results of this study? It was possible to identify over three thousand words which occurred significantly more often in the public health literature than in general English. This doesn't mean that we have to memorize 3,000 new words because this list of 3,000 includes many words from the GSL and AWL. Furthermore, this list was pared down by eliminating words which occurred less than once per 10,000 words. This would be about once every 4 or 5 articles and reduced the list to 951 words. This eliminated rare words which were statistically over-represented but of such low prevalence as to be of little importance for learning purposes. A further criterion was applied: dispersion of 30% or more (i.e. present in 30% or more of the journal issues), reducing the list to 928 words. This list was then sorted to identify 486 words falling within the most common 2000 words in the English language (the GSL); 275 words within the Academic Word List (the AWL) and 153 key words in neither and therefore likely to be particular to public health.
How can we apply this to reading publich health research? Well, we can probably assume that anyone trying to read such articles has already mastered the GSL and probably the AWL as well. And so, the best investment of their time would be to learn the meanings of those 153 other special terms which occur broadly and with high frequency in public health. Learning just those 153 words would give them the same benefit as learning thousands of other English words as they came up randomly in reading.
Key point: If one already knows the 2,000 GSL words and the 570 AWL words, one only needs to learn a few key technical words in order to read and understand the public health literature.
By the way, ranked in descending order, the 30 most prevalent words from the 3 lists are presented in Table 2 along with the raw frequency and dispersion throughout the corpus expressed as a percentage. The first column shows words which are neither from the AWL nor the GSL. In linguistics, such words are often referred to as “off-list”. Keywords from GSL and AWL are often familiar to learners, and even if they have special nuances within a biomedical or health speciality, their meanings can generally be inferred from context. The off-list words, however, present a special challenge to the learner. Such words are infrequently encountered in general English and so these words have to be deliberately studied.
In addition to individual words, this study extracted n-grams – phrases - where ‘n’ denotes the number of words in the unit. Starting with 7-grams (seven word phrases), we worked down to 3-grams extracting meaningful multi-word sequences. This was done for the 50 most frequent 7-3 word phrases and the top ranking five 7-3 n-grams are shown in Table 3.
the centers for disease control and prevention
the purpose of this study was to
between the ages of # and #
between # and # years of age
from # in # to # in
women / children aged # to # years
men who have sex with men
the national center for health statistics
there were no significant differences in
it is important to note that
aged # years and older
were more likely to be
on the basis of the
ranged from # to #
were more likely to have
at the time of
as a result of
in the case of
as well as the
on the other hand
the prevalence of
in order to
in terms of
based on the
the fact that
Note. freq. = frequency.
What we have learned from this study is that technical vocabulary plays an important role in practices such as public health. However, from a detailed computer analysis of a very large corpus of the public health literature (almost 2 million words) we were able to identify a relatively small number of key words and phrases which especially characterize journal publications in this field.
Quite a few words which are important in public health come from the GSL and AWL, but they occur so much more often in public health that they probably have a special meaning in this field. For example, looking at table 2, "intervention", "outcome" and "exposure" have low frequency rankings on the AWL but were highly prevalent in the public health corpus, and in fact do carry a specific technical meaning when used in public health. In other words, looking up the meanings of these words in conventional dictionaries probably will not be a great help in understanding their use in public health.
We also have to deal with those 153 non-GSL, non-AWL words which are likely to be less readily accessible and arguably of highly specific meaning in public health research. These include epidemiological terms (prevalence, mortality, baseline and regression), abbreviations (ci, tb and bmi), and names of diseases and medical conditions (cancer, diabetes and obesity).
The acquisition of technical vocabulary presents an important challenge to language learners. However, awareness of these key terms and phrases could greatly facilitate the acquisition of communicative competence in the language of public health, and presumably in other health disciplines as well. Furthermore, this would contribute to more effective dispersion and implementation of health care knowledge. Language learning is too vast and our time is too restricted to be wasted on undirected and inefficient learning strategies.
And so, we should now have some idea of how many words we need to know, how those words can be identified, and how to learn them.
Key point: We need to know the 2,000 words of the GSL, the approximately 570 words of the AWL, and probably a few hundred technical terms specific to our field. The best way to identify those key words (and phrases) is by computer analysis of large corpora of the literature.
More recently, a corpus of midwifery and perinatal care (MPC corpus) was developed and analyzed. The MPC corpus was derived from articles published in 2005 in 5 leading journals in the field.
Table 4. Composition of the Corpus of Midwifery and Perinatal Care
Journal (2005 impact factor)
J Midwifery Womens Health (0.758)
J Obstet Gynecol Neonatal Nurs (0.846)
J Perinat Neonatal Nurs (0.654)
As you can see, the MPC corpus amounted to a little over 1 million words. The frequency with which words occurred in this corpus was compared with their frequency in a corpus of general English derived from the New York Times – this was called the NYT corpus. This made is possible to identify 3,590 words which were “over-represented” in midwifery and perinatal care; that is to say these words occurred significantly more often than would be expected by comparison to general English. Filtering to remove rare words and words which were not widely dispersed across the MPC corpus reduced the list to 1,108 key words of which 335 were from among the 1,000 most common words in the English language, and 141 were from among the second 1,000 most common words. Thus, 476 of the words were from the GSL. An additional 390 words were from among the word families of the AWL. The remaining 242 core words were off-list, neither from the GSL or AWL.
Now we must ask ourselves how these findings can be applied to help someone who is trying to read and understand the literature of midwifery and perinatal care. We may assume that anyone reading in biomedicine and health has already mastered the GSL. This is the basis of even the most rudimentary English language programmes. Knowledge of the AWL is also widespread, and so we need to concentrate on our off-list words. How many do we need to learn in order to achieve our primary target of 95% comprehension?
Figure 2 answers this question for us. Here, the percentage coverage of the corpus is plotted against the number of words that we know. In this graph, the vertical scale begins at about 85.6%. This is the coverage that we achieve with just the GSL and AWL words. By the way, this is close to the comparable figures for general English, demonstrating that the midwifery literature does not rely heavily on esoteric terms. From figure 2 we can see that by adding 1,000 off-list words we can boost our coverage to 94.8% which is approximately our target threshold for comprehension. Furthermore, just learning the 242 highly-prevalent off-list words that we identified in this study would give 90% coverage.
Figure 2. Percentage coverage of MPC corpus provided by the top three thousand off-list keywords
What does this mean for the learner who wants to read and understand the literature of midwifery and perinatal care? It means that they can achieve functional reading literacy by learning the 2,000 GSL words, the 570 AWL words, and a mere 1,000 off-list words. Furthermore, their time would best be spent by concentrating first on 242 highly-prevalent keywords.
To put this into context, we remind you that it takes 15,000 words to achieve the same degree of functional literacy in general English, and that the average adult learner of English is lucky to acquire 5,000 words. In other words, it is much easier to use a focused learning approach and become functionally literate in a sub-specialty of biomedicine and health, than it is to achieve the same level of literacy in general English!
Key point: Compared to the requirements for literacy in general English, and the actual performance of average adult English learners, it is quite easy to achieve a high level of literacy in a sub-specialty of biomedicine and health by using a focused approach to vocabulary acquisition.
Now it is time to discover the essential vocabulary for your specialty. Begin by searching a journal within your specialty, looking for an interesting article which is a few thousand words in length. Copy the article, from the beginning of the title to the end of the discussion, into a text file. Remove any references in the text and any other extraneous material such as figure legends. You want to analyze just the text of the article. Once it is cleaned up, insert the “meta-data” at the beginning of the file to identify the source of your material.
To analyze just your one article for the proportion of words (GSL vs AWL vs off-list) we can use a tool on a very versatile language learning site developed by Tom Cobb at l’Universite du Quebec a Montreal. To analyze your article, copy only the body of the file and paste it into the vocabulary profiler window on this page:
Press the “Submit_window” button and watch the programme analyze the article that you have chosen. The programme will show you how many works are from the GSL (K1 and K2), from the AWL and off-list. It will also show you how many times each one of these words appeared in the text. In general English texts, 80% of the words are from the GSL – usually about 70% are K1 (the first one thousand most common words) and 10% are K2 (the second thousand most common words). Then, the remaining 20% of the words are evenly divided between AWL and off-list. How does your article compare?
To see how different words are actually used in context, you can paste your text into the window on this page:
This programme is known as a concordancer and shows how each word is used in context. This is great for finding “example” sentences which help you to infer the meanings of words that you do not understand. If you have a large corpus, a concordance can also be used to search for example sentences that you can mimic for your own writing. This is something that was previously done for the corpus of midwifery and perinatal care.
1. Flowerdew J. Definitions in science lectures. Applied Linguistics 1992;13(2):202-221.
2. Budgell B, Miyazaki M, O’Brien M, Perkins R, Tanaka Y. Developing a corpus of the nursing literature: a pilot study. Japan Journal of Nursing Science 2007;4:21-25.
3. Millar N, Budgell B. The language of public health – a corpus based analysis. The Journal of Public Health 2008;16(5):369-374.
|< Prev||Next >|