fr.gouv.culture.sdx.search.lucene.analysis
Class Glosser_ar_en

java.lang.Object
  extended byorg.apache.lucene.analysis.Analyzer
      extended byfr.gouv.culture.sdx.search.lucene.analysis.AbstractAnalyzer
          extended byfr.gouv.culture.sdx.search.lucene.analysis.Glosser_ar_en
All Implemented Interfaces:
Analyzer, org.apache.avalon.framework.configuration.Configurable, org.apache.avalon.framework.logger.LogEnabled, java.io.Serializable, org.apache.excalibur.xml.sax.XMLizable

public final class Glosser_ar_en
extends AbstractAnalyzer

An english glosser for the arabic language. This glosser uses Tim Buckwalter's algorithm (available at LDC Catalog) to identify the morphological category of arabic tokens and then return their glosses. The meaningful morphological categories are still to be determined but the current list gives good results.

Author:
Pierrick Brihaye, 2003
See Also:
Serialized Form

Field Summary
protected static java.lang.String ANALYZER_TYPE
           
static java.lang.String[] STOP_WORDS
          An array containing some common english words that are usually not useful for searching.
 
Fields inherited from class fr.gouv.culture.sdx.search.lucene.analysis.AbstractAnalyzer
logger
 
Constructor Summary
Glosser_ar_en()
           
 
Method Summary
 void configure(org.apache.avalon.framework.configuration.Configuration configuration)
          Configure the glosser.
 void enableLogging(org.apache.avalon.framework.logger.Logger logger)
          Transmits a super.getLog() to the class.
protected  java.lang.String getAnalyzerType()
           
 org.apache.lucene.analysis.TokenStream tokenStream(java.lang.String fieldName, java.io.Reader reader)
          Returns a token stream of glosses of arabic words whose morphological categories are found to be semantically meaningful.
 
Methods inherited from class fr.gouv.culture.sdx.search.lucene.analysis.AbstractAnalyzer
toSAX
 
Methods inherited from class org.apache.lucene.analysis.Analyzer
tokenStream
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 
Methods inherited from interface fr.gouv.culture.sdx.search.lucene.analysis.Analyzer
tokenStream
 

Field Detail

ANALYZER_TYPE

protected static final java.lang.String ANALYZER_TYPE
See Also:
Constant Field Values

STOP_WORDS

public static final java.lang.String[] STOP_WORDS
An array containing some common english words that are usually not useful for searching.

Constructor Detail

Glosser_ar_en

public Glosser_ar_en()
Method Detail

getAnalyzerType

protected java.lang.String getAnalyzerType()
Specified by:
getAnalyzerType in class AbstractAnalyzer

configure

public void configure(org.apache.avalon.framework.configuration.Configuration configuration)
               throws org.apache.avalon.framework.configuration.ConfigurationException
Configure the glosser.

Specified by:
configure in interface org.apache.avalon.framework.configuration.Configurable
Overrides:
configure in class AbstractAnalyzer
Parameters:
configuration - The configuration object
Throws:
org.apache.avalon.framework.configuration.ConfigurationException - If a problem occurs during configuration

enableLogging

public void enableLogging(org.apache.avalon.framework.logger.Logger logger)
Transmits a super.getLog() to the class.

Specified by:
enableLogging in interface org.apache.avalon.framework.logger.LogEnabled
Overrides:
enableLogging in class AbstractAnalyzer
Parameters:
logger - The super.getLog()

tokenStream

public org.apache.lucene.analysis.TokenStream tokenStream(java.lang.String fieldName,
                                                          java.io.Reader reader)
Returns a token stream of glosses of arabic words whose morphological categories are found to be semantically meaningful.

Parameters:
reader - The reader
Returns:
The token stream


Copyright © 2000-2003 Ministere de la culture et de la communication / AJLSM. All Rights Reserved.