Oracle8 ConText Cartridge Administrator's Guide Release 2.4 A63820-01 |
|
This chapter provides reference information for using the text loading utility, ctxload, provided with ConText.
The topics discussed in this chapter are:
The ctxload utility can be used to perform the following operations:
Use ctxload to load text from a load file into a LONG or LONG RAW column in a table.
A load file is an ASCII flat file that contains the plain text, as well as any structured data (title, author, date, etc.), for documents to be stored in a text table; however, in place of the text for each document, the load file can store a pointer to a separate file that holds the actual text (formatted or plain) of the document.
The ctxload utility creates one row in the table for each document identified by a header in the load file.
See Also:
For examples of load files for text loading, see "Structure of Text Load File" in this chapter. |
The ctxload utility supports updating database columns from operating system files and exporting database columns to files, specifically LONG RAW and LONG columns used as text columns for ConText.
Use ctxload to load a thesaurus from an import file into the ConText thesaurus tables.
An import file is an ASCII flat file that contains entries for synonyms, broader terms, narrower terms, or related terms which can be used to expand queries.
ctxload can also be used to export a thesaurus by dumping the contents of the thesaurus into a user-specified operating-system file.
See Also:
For examples of import files for thesaurus importing, see "Structure of Thesaurus Import File" in this chapter. |
The syntax for running ctxload is:
ctxload -user username[/password][@sqlnet_address] -name object_name -file file_name -pk primary_key [-export] [-update] [-thes] [-thescase y|n] [-thesdump] [-separate] [-longsize n] [-date date_mask] [-log file_name] [-trace] [-commitafter n] [-verbose]
Specifies the username and password of the user running ctxload.
The username and password can be followed immediately by @sqlnet_address to permit logon to remote databases. The value for sqlnet_address is a database connect string. If the TWO_TASK environment variable is set to a remote database, you do not have to specify a value for sqlnet_address to connect to the database.
If ctxload is used to load text, -name specifies the name of the table to be loaded. The table must be accessible to the user specified in the command-line.
If ctxload is used to update/export a text column, -name specifies the policy for the column to be exported/updated.
If ctxload is used to import a thesaurus, -name specifies the name of the thesaurus to be imported. The thesaurus name is used to specify the thesaurus to be used for expanding query terms in queries.
If ctxload is used to export a thesaurus, -name specifies the name of the thesaurus to be exported.
If ctxload is used to load text, -file specifies the name of the load file which contains the document header markers, structured data, and text/file pointers (see the -separate argument).
If ctxload is used to update a row in a text column, -file specifies the file which stores the text to be inserted into the text column for the row specified by -pk.
If ctxload is used to export a row in a text column, -file specifies the file which stores the text to be exported from the text column for the row specified by -pk.
If ctxload is used to load a thesaurus, -file specifies the name of the import file which contains the thesaurus entries.
If ctxload is used to export a thesaurus, -file specifies the name of the export file created by ctxload.
Specifies the primary key for the row in which the text column (LONG or LONG RAW) to be exported/updated is located.
For tables that contain composite primary keys, enter the multiple primary key values as a string, with each primary key value separated by a comma.
Specifies that ctxload exports the contents of a cell in a database table into the operating system file specified by -file. The cell is identified as the LONG RAW or LONG column for the row specified by -pk in the table for the policy specified by -name.
Specifies that ctxload updates the contents of a cell in a database table with the contents of the operating system file specified by -file. The cell is identified as the LONG RAW or LONG column for the row specified by -pk in the table for the policy specified by -name.
Specifies that ctxload imports a thesaurus. The file from which it loads the thesaurus is specified by the -file argument. The name of the thesaurus to be imported is specified by the -name argument.
Specifies whether ctxload create a case-sensitive thesaurus with the name specified by -name and populate the thesaurus with entries from the thesaurus import file specified by -file. If -thescase is 'y' (the thesaurus is case-sensitive), ConText enters the terms in the thesaurus exactly as they appear in the import file.
The default for -thescase is 'n' (case-insensitive thesaurus)
Specifies that ctxload exports a thesaurus. The name of the thesaurus to be exported is specified by the -name argument. The file into which the thesaurus is dumped is specified by the -file argument.
For text loading, specifies that the text of each document in the load file is actually a pointer to a separate text file. During processing, ctxload loads the contents of each text file into the LONG or LONG RAW column for the specified row.
For text loading, specifies the maximum number of kilobytes to be loaded into the LONG or LONG RAW column. This argument may be necessary for loading separate data and to help reduce memory usage when loading smaller embedded data.
The minimum value is 1 (Kb, i.e. 1024 bytes) and the maximum value is machine-dependent. The default is 64 (Kb).
Specifies the TO_CHAR date format for any date columns loaded using ctxload.
Specifies the name of the log file to which ctxload writes any national-language supported (NLS) messages generated during processing. If you do not specify a log file name, the messages appear on the standard output.
Specifies that a server process trace file is enabled using 'ALTER SESSION SET SQL_TRACE TRUE'. This command captures all processed SQL statements in a trace file, which can be used for debugging purposes. The location of the trace file is operating-system dependent and may be modified using the USER_DUMP_DEST initialization parameter.
Specifies the number of rows (documents) that are inserted into the table before a commit is issued to the database. The default is 1.
Specifies that non-NLS messages can appear on standard output.
The following conditions apply to the command-line syntax:
This section provides examples for each of the operations that ctxload can perform:
The following example loads documents from the reviews.txt load file into table docs for user jsmith. It also writes log information to a file called log2.out. Because -commitafter was not specified, each row (document) is committed to the database after it is inserted into the docs table.
Also, because -separate was not specified, ctxload expects the text for each document to be embedded in the reviews.txt file.
ctxload -user jsmith/123abc -name docs -file review.txt -log log2.out
The following UNIX-based example illustrates updating the LONG RAW column for the row identified by primary key 3452 in the table for a policy named word_docs. The column is updated with the contents of resume1.doc located in /docs:
ctxload -user ctxdemo/passwd -update -name word_docs -pk 3452 -file /docs/resume1.doc
The following UNIX-based example illustrates exporting the LONG RAW column for the row identified by primary key 3452 in the table for a policy named word_docs. The contents of the cell in the column are copied to a file named new.doc located in /docs:
ctxload -user ctxdemo/passwd -export -name word_docs -pk 3452 -file /docs/new.doc
The following example is identical to the preceding example, except the row is identified by a compound primary key consisting of a name and location. The name and location values are separate by a comma and the entire primary key string is enclosed in double quotation marks because the location value includes a space:
ctxload -user ctxdemo/passwd -export -name word_docs -pk "Smith,HQ 1" -file /docs/new.doc
The following example imports a thesaurus named tech_doc from an import file named tech_thesaurus.txt:
ctxload -user jsmith/123abc -thes -name tech_doc -file tech_thesaurus.txt
The following example dumps the contents of a thesaurus named tech_doc into a file named tech_thesaurus.out:
ctxload -user jsmith/123abc -thesdump -name tech_doc -file tech_thesaurus.out
The load file must use the following format for each document, as well as any structured data associated with the document:
<TEXTSTART: col_name1=doc_data, col_name2=doc_data,...col_nameN=doc_data> text. . . <TEXTEND>
where:
is a header marker that indicates the beginning of a document. It also may contain one or more of the following fields used to specify structured data for a document:
is the name of a column that will store structured data for the document.
is the structured data that will be stored in the column specified in col_name.
is the text of the document to be loaded or the name (and location, if necessary) of an operating system file containing the text to be loaded.
indicates the end of the document.
The following conditions apply to the structure of the load file:
The following conditions apply to the syntax utilized in the text load file:
The following example illustrates a correctly formatted text load file containing structured employee information, such as employee number (1000, 1024) and name (Joe Smith, Mary Jones), and the text for each document:
<TEXTSTART: EMPNO=1000, ELNAME='Smith', EFNAME='Joe'> Joe has an interesting resume, includes...cliff-diving. <TEXTEND> <TEXTSTART: EMPNO=1024, EFNAME='Mary', ELNAME='Jones'> Mary has many excellent skills, including...technical, marketing, and organizational. Team player. <TEXTEND>
The following example illustrates a correctly formatted text load file containing structured employee information, such as employee number (1000, 1024) and name (Joe Smith, Mary Jones), and a file name pointer for each document.
<TEXTSTART: EMPNO=1024, EFNAME='Mary', ELNAME='Jones'> mjones.doc <TEXTEND> <TEXTSTART: EMPNO=1000, EFNAME='Joe', EFNAME='Smith'> jsmith.doc <TEXTEND>
The import file must use the following format for entries in the thesaurus:
phrase BT broader_term NT narrower_term1 NT narrower_term2 . . . NT narrower_termN BTG broader_term NTG narrower_term1 NTG narrower_term2 . . . NTG narrower_termN BTP broader_term NTP narrower_term1 NTP narrower_term2 . . . NTP narrower_termN BTI broader_term NTI narrower_term1 NTI narrower_term2 . . . NTI narrower_termN SYN synonym1 SYN synonym2 . . . SYN synonymN USE|SEE synonym1 RT related_term1 RT related_term2 . . . RN related_termN SN text
where:
is a word or phrase that is defined as having synonyms, broader terms, narrower terms, and/or related terms.
In compliance with ISO-2788 standards, a TT marker can be placed before a phrase to indicate that the phrase is the top term in a hierarchy; however, the TT marker is not required. In fact, ctxload ignores TT markers during import.
In ConText, a top term is identified as any phrase that does not have a broader term (BT, BTG, BTP, or BTI).
are the markers that indicate broader_termN is a broader (generic|partitive|instance) term for phrase.
are the markers that indicate narrower_termN is a narrower (generic|partitive|instance) term for phrase.
If phrase does not have a broader (generic|partitive|instance) term, but has one or more narrower (generic|partitive|instance) terms, phrase is created as a top term in the respective hierarchy (in a ConText thesaurus, the BT/NT, BTG/NTG, BTP/NTP, and BTI/NTI hierarchies are separate structures).
is a marker that indicates phrase and synonymN are synonyms within a synonym ring.
are markers that indicate phrase and synonymN are synonyms within a synonym ring (similar to SYN); however, USE | SEE also indicates synonymN is the preferred term for the synonym ring. Either marker can be used to define the preferred term for a synonym ring.
is the marker that indicates related_termN is a related term for phrase.
is the marker that indicates the following text is a scope note (i.e. comment) for the preceding entry.
is a word or phrase that conceptually provides a more general description or category for phrase. For example, the word elephant could have a broader term of land mammal.
is a word or phrase that conceptually provides a more specific description for phrase. For example, the word elephant could have a narrower terms of indian elephant and african elephant.
is a word or phrase that has the same meaning for phrase. For example, the word elephant could have a synonym of pachyderm.
is a word or phrase that has a meaning related to, but not necessarily synonymous with phrase. For example, the word elephant could have a related term of wooly mammoth.
In compliance with thesauri standards, the load file supports formatting hierarchies (BT/NT, BTG/NTG, BTP, NTP, BTI/NTI) by indenting the terms under the top term and using NT (or NTG, NTP, NTI) markers that indicate the level for the term:
phrase NT1 narrower_term1 NT2 narrower_term1.1 NT2 narrower_term1.2 NT3 narrower_term1.2.1 NT3 narrower_term1.2.2 NT1 narrower_term2 . . . NT1 narrower_termN
Using this method, the entire branch for a top term can be represented hierarchically in the load file.
The following conditions apply to the structure of the entries in the import file:
For example: cranes (birds), cranes (lifting equipment)
Example of Incorrect SN usage:
VIEW CAMERAS SN Cameras with through-the lens focusing and a range of movements of the lens plane relative to the film plane
Example of Correct SN usage:
VIEW CAMERAS SN Cameras with through-the lens focusing and a SN range of movements of the lens plane relative SN to the film plane
The following conditions apply to the relationships defined for the entries in the import file:
Example of incorrect RT usage:
MOVING PICTURE CAMERAS RT CINE CAMERAS TELEVISION CAMERAS
Example of correct RT usage:
MOVING PICTURE CAMERAS RT CINE CAMERAS RT TELEVISION CAMERAS
This section provides three examples of correctly formatted thesaurus import files.
cat SYN feline NT domestic cat NT wild cat BT mammal mammal BT animal domestic cat NT Persian cat NT Siamese cat wild cat NT tiger tiger NT Bengal tiger dog BT mammal NT domestic dog NT wild dog SYN canine domestic dog NT German Shepard wild dog NT Dingo
animal NT1 mammal NT2 cat NT3 domestic cat NT4 Persian cat NT4 Siamese cat NT3 wild cat NT4 tiger NT5 Bengal tiger NT2 dog NT3 domestic dog NT4 German Shepard NT3 wild dog NT4 Dingo cat SYN feline dog SYN canine
35MM CAMERAS BT MINIATURE CAMERAS CAMERAS BT OPTICAL EQUIPMENT NT MOVING PICTURE CAMERAS NT STEREO CAMERAS LAND CAMERAS USE VIEW CAMERAS VIEW CAMERAS SN Cameras with through-the lens focusing and a range of SN movements of the lens plane relative to the film plane UF LAND CAMERAS BT STILL CAMERAS