PAN L10n Wiki : GuidelinesForPartOfSpeechClassification

HomePage :: Categories :: PageIndex :: RecentChanges :: RecentlyCommented :: Login/Register

Guidelines for Part of Speech Classification




1. Introduction
This document presents a stepwise scheme to construct the part of speech tagset for a language. The templates to record the list of part of speech (SummaryOfPartOfSpeech.doc) and the relevant discussion and analysis about each part of speech (PartOfSpeechTemplate.doc) are attached with the document.

2. Part of speech classification
We present a hierarchical approach to facilitate the part of speech description. There will be a three level classification of tag set. First level consists of generic categories which are mostly common for all languages. List of these categories is given in the following section. These categories are not tags themselves rather they will serve as basis of second level of classification which will be set of tags used for corpus tagging. All categories which are relevant to the specific language should be mapped to one or more tags. The third level division will be based on morphological level differences (inflectional rules in general) such as noun or verb variant form due to gender, number or tense etc. This level is normally optional and will not be considered in initial activity of POS tagging. Tags as this level can be added if found necessary for the language under consideration or the application using tagged corpus.

2.1. Part of Speech Categories
In this section, the part of speech categories are presented which are mostly common for all languages. This list will help in deciding first level categories for each language.

Noun
A noun is the name of a person, place, thing, or idea. Whatever exists, we assume, can be named, and that name is a noun.
Examples:
John and Maria went to school.

Verb
Verbs carry the idea of being or action in the sentence.
Examples:
The girl went to school.
The girl is playing with the doll.
He is a good boy.
I can do it.

Adjective
Adjectives are words that describe or modify nouns.
Examples:
This is a good book.
Give me the beautiful red scarf.
He is very surprised.
This word ‘surprised’ will be considered as adjective (rather than a verb) because it is following adjective rules i.e. occurring at adjective position and taking adverb ‘very’ with it.

Adverb
Adverbs are words that modify another adverb, a verb, an adjective or clause on the whole.
Examples:
I am very happy.
She drove slowly.
She moved quite slowly.
He moved right after me.

Preposition/ Postposition
A preposition/postposition describes temporal or spatial relationship between things, events and actions in a sentence.
Examples:
The person standing after John.
My book is on the table.

Coordinate Conjunction
Conjunction which joins two equal level phrases or clauses is called coordinate conjunction.
Examples:
Rabia, Maria and Ayesha are going.
Rabia is brilliant and Saima has a pleasant personality.
John lost his book, but he still has his pen.
I have a cat or a dog.

Subordinate Conjunction
Conjunctions which introduce embedded clause are called subordinate conjunctions.
Examples:
She is going because she wants to go.
She wrote while he was sleeping.
He came there after the boys left.

Pronoun
Generally pronouns stand for (pro + noun) or refer to a noun, an individual or individuals or thing or things (the pronoun's antecedent) whose identity is made clear earlier in the text.
Examples:
We are good at making coffee.
She asked him a question.
Who is at the door?

Determiner
The determiners are words which are used with nouns to restrict their meaning by limiting the reference of the noun. There are many types of determiners in English as shown in following examples.
Examples:
Who is at the door?
Two cars are parked in the building.
My car is parked in the building.
I bought many books.
This book is good.

Interjection
Interjections are the words which are used to show emotions. Such words can be stand alone and occur as complete sentence.
Examples:
My, what a gorgeous day.
Wow, how beautiful it is.

Punctuation
Treating the punctuations e.g. comma ‘,’, semicolon ‘;’ as words may help in automatic tagging. Punctuations may give good clues for surrounding parts of speech.

Single word tags
This is not a category indicating a particular group; rather it has been defined to indicate that sometimes a single word (instead of a group of words) has unique syntactic behavior and may be given a unique tag, normally the same as the word itself. For example in sentence “he tried to learn geography”, the word ‘to’ is not acting as a preposition but indicates that the next verb is in infinitive form. This use of ‘to’ is thus different from its use in the sentence “he is going to the school.” Thus, TO can be a part of speech by itself in the context of the infinitive.

Symbols
For mathematical or technical symbols, etc.

2.2. Part of Speech Category Definition
First step of part of speech analysis will be to go through these categories and decide which are relevant to a particular language. The following cases may occur:
• There is a possibility that some category is not applicable to a language. This category will need to be discarded.
• One may want to merge two categories if they are very similar in a language.
• More categories can also be added according to analysis of a language.

The categories should have sufficient difference to make them separate. If there are two categories which have many similar properties and we are not sure that we may need to handle them alike in future, it is better to put them as one category and divide at next level.

List all the categories you decide in SummaryOfPartOfSpeech.doc attached.

2.3. Basic Part of Speech Tags
Once POS categories are decided, they will be mapped to the real tags which will be used for annotation of corpus. All or few categories will be further divided to specify the distinction in the syntactic behavior of the words under the same category.

Starting from Noun category, we may want to divide nouns into multiple tags if we notice difference of structure among different noun groups. For example, proper nouns behave differently from common nouns. Thus, we can divide nouns into “common noun” and “proper noun” tags. There can be another division of countable nouns and mass nouns. Some determiners come with countable nouns in English whereas they cannot come with mass nouns and visa versa. For example we can say ‘much water’ but not ‘much boy’. The question to answer is which differences are important for our use and which are not. It is not practical to handle all the differences because then we will end up making a very large tagset which will be hard to manage during the annotation of corpus. So, we need to make the compromise and decide on what differences are crucial to our application.

Review each category and decide if it can be sub-categorized. Assign a POS tag to each sub-category (or to the category if no sub-categorization is possible). While selecting the tag for any part of speech, check the list of Penn TreeBank tags given in PennGuidelines.pdf, Section 3. If there is any tag already listed for that particular POS, use that tag, otherwise devise a new one on similar format Present the information about each tag in PartOfSpeechTemplate.doc.

2.4. Detailed Part of Speech Tags
The third level division is optional and will handle morphological differences such as noun variant forms due to gender or number or verb variant forms due to tense, gender or number agreement. For example, a common noun can be singular e.g. ‘car’ or plural ‘cars’. On the basis of number, common nouns can be further divided into “common noun singular” and “common noun plural”. In the same way, an ‘-s’ is added to the verb in present tense when its subject is third person singular or ‘-ed’ is added to the verb when it is past or past participle. All such agreement variations will be handled at this level. We will normally not be using this level tags for annotation but such tags will be listed with their respective parent tag in its template under heading of “child tags” in the file PartOfSpeechTemplate.doc.

There are no comments on this page. [Add comment]

Valid XHTML 1.0 Transitional :: Valid CSS :: Powered by Wikka Wakka Wiki 1.1.6.3
Page was generated in 0.1960 seconds