10
Text Loading Utility

This chapter provides reference information for using the text loading utility, ctxload, provided with ConText.

The topics discussed in this chapter are:

Overview of ctxload

The ctxload utility can be used to perform the following operations:

Text Loading

Use ctxload to load text from a load file into a LONG or LONG RAW column in a table.

Suggestion:
If the target table does not contain a LONG or LONG RAW column or you do not want to load text into a LONG or LONG RAW column, you may want to use SQL*Loader to populate the table with text.

A load file is an ASCII flat file that contains the plain text, as well as any structured data (title, author, date, etc.), for documents to be stored in a text table; however, in place of the text for each document, the load file can store a pointer to a separate file that holds the actual text (formatted or plain) of the document.

Note:
The ctxload utility does not support load files that contain both embedded text and file pointers. You must use one method or the other when creating a load file.

The ctxload utility creates one row in the table for each document identified by a header in the load file.

See Also:
For examples of load files for text loading, see "Structure of Text Load File" in this chapter.

Document Updating/Exporting

The ctxload utility supports updating database columns from operating system files and exporting database columns to files, specifically LONG RAW and LONG columns used as text columns for ConText.

Note:
The updating/exporting of data is performed in sections to avoid the necessity of a large amount of memory (up to 2 Gigabytes) for the update/fetch buffer.
As a result, a minimum of 16 Kilobytes of memory is required for document update/export.

Thesaurus Importing and Exporting

Use ctxload to load a thesaurus from an import file into the ConText thesaurus tables.

An import file is an ASCII flat file that contains entries for synonyms, broader terms, narrower terms, or related terms which can be used to expand queries.

ctxload can also be used to export a thesaurus by dumping the contents of the thesaurus into a user-specified operating-system file.

See Also:
For examples of import files for thesaurus importing, see "Structure of Thesaurus Import File" in this chapter.

Command-line Syntax

The syntax for running ctxload is:

ctxload -user username[/password][@sqlnet_address]
        -name object_name
        -file file_name
        -pk primary_key
       [-export]
       [-update]
       [-thes]
       [-thescase y|n]
       [-thesdump]
       [-separate]
       [-longsize n]
       [-date date_mask]
       [-log file_name]
       [-trace]
       [-commitafter n]
       [-verbose]

Mandatory Arguments

-user

Specifies the username and password of the user running ctxload.

The username and password can be followed immediately by @sqlnet_address to permit logon to remote databases. The value for sqlnet_address is a database connect string. If the TWO_TASK environment variable is set to a remote database, you do not have to specify a value for sqlnet_address to connect to the database.

-name

If ctxload is used to load text, -name specifies the name of the table to be loaded. The table must be accessible to the user specified in the command-line.

If ctxload is used to update/export a text column, -name specifies the policy for the column to be exported/updated.

If ctxload is used to import a thesaurus, -name specifies the name of the thesaurus to be imported. The thesaurus name is used to specify the thesaurus to be used for expanding query terms in queries.

Note:
Thesaurus name must be unique. If the name specified for the thesaurus is identical to an existing thesaurus, ctxload returns an error and does not overwrite the existing thesaurus.

If ctxload is used to export a thesaurus, -name specifies the name of the thesaurus to be exported.

-file

If ctxload is used to load text, -file specifies the name of the load file which contains the document header markers, structured data, and text/file pointers (see the -separate argument).

If ctxload is used to update a row in a text column, -file specifies the file which stores the text to be inserted into the text column for the row specified by -pk.

If ctxload is used to export a row in a text column, -file specifies the file which stores the text to be exported from the text column for the row specified by -pk.

If ctxload is used to load a thesaurus, -file specifies the name of the import file which contains the thesaurus entries.

If ctxload is used to export a thesaurus, -file specifies the name of the export file created by ctxload.

Note:
If the name specified for the thesaurus dump file is identical to an existing file, ctxload overwrites the existing file.

-pk

Specifies the primary key for the row in which the text column (LONG or LONG RAW) to be exported/updated is located.

Note:
A value is required for -pk only when ctxload is used to update/export the contents of a text column for a row.

For tables that contain composite primary keys, enter the multiple primary key values as a string, with each primary key value separated by a comma.

Note:
For composite textkeys, the string must be entered in the same order in which the primary key columns were defined as textkeys for the policy.
If the primary key value(s) contain blank spaces, the entire value for -pk must be enclosed in double quotation marks (" ").
For example:
...-pk "3452,Joe Smith,500 Oracle Parkway"...
If the primary key values contain commas ( , ) or backslashes ( \ ), each comma/backslash must be preceded by a backslash.
For example:
...-pk "3452,Smith\, Joe"...
In this example, the second value `Smith, Joe' contains a blank space, so the entire primary key value is enclosed in double quotes.

Optional Arguments

-export

Specifies that ctxload exports the contents of a cell in a database table into the operating system file specified by -file. The cell is identified as the LONG RAW or LONG column for the row specified by -pk in the table for the policy specified by -name.

Note:
If the file specified by -file already exists, ctxload overwrites the contents of the file with the contents of the LONG/LONG RAW column.

-update

Specifies that ctxload updates the contents of a cell in a database table with the contents of the operating system file specified by -file. The cell is identified as the LONG RAW or LONG column for the row specified by -pk in the table for the policy specified by -name.

-thes

Specifies that ctxload imports a thesaurus. The file from which it loads the thesaurus is specified by the -file argument. The name of the thesaurus to be imported is specified by the -name argument.

-thescase

Specifies whether ctxload create a case-sensitive thesaurus with the name specified by -name and populate the thesaurus with entries from the thesaurus import file specified by -file. If -thescase is 'y' (the thesaurus is case-sensitive), ConText enters the terms in the thesaurus exactly as they appear in the import file.

The default for -thescase is 'n' (case-insensitive thesaurus)

Note:
-thescase is only valid for use with the -thes argument.

-thesdump

Specifies that ctxload exports a thesaurus. The name of the thesaurus to be exported is specified by the -name argument. The file into which the thesaurus is dumped is specified by the -file argument.

-separate

For text loading, specifies that the text of each document in the load file is actually a pointer to a separate text file. During processing, ctxload loads the contents of each text file into the LONG or LONG RAW column for the specified row.

-longsize

For text loading, specifies the maximum number of kilobytes to be loaded into the LONG or LONG RAW column. This argument may be necessary for loading separate data and to help reduce memory usage when loading smaller embedded data.

The minimum value is 1 (Kb, i.e. 1024 bytes) and the maximum value is machine-dependent. The default is 64 (Kb).

Note:
The value for -longsize must be entered as a numeric value. Do not include a 'K' or 'k' to indicate Kilobytes.

-date

Specifies the TO_CHAR date format for any date columns loaded using ctxload.

See Also:
For more information about the available date format models, see Oracle8 SQL Reference.

-log

Specifies the name of the log file to which ctxload writes any national-language supported (NLS) messages generated during processing. If you do not specify a log file name, the messages appear on the standard output.

-trace

Specifies that a server process trace file is enabled using 'ALTER SESSION SET SQL_TRACE TRUE'. This command captures all processed SQL statements in a trace file, which can be used for debugging purposes. The location of the trace file is operating-system dependent and may be modified using the USER_DUMP_DEST initialization parameter.

-commitafter

Specifies the number of rows (documents) that are inserted into the table before a commit is issued to the database. The default is 1.

-verbose

Specifies that non-NLS messages can appear on standard output.

Usage Notes

The following conditions apply to the command-line syntax:

if you do not specify -thes or -thesdump, by default ctxload loads text into the specified table.
for text loading, you do not need to specify a column name because ctxload automatically loads text to the LONG or LONG RAW column in a table and a table can contain only one such column
if you use embedded text instead of separate file pointers in the text load file, do not use the -separate option
loading text from separate files (using the -separate option) is faster, in general, than loading text embedded in the load file

Command-line Examples

This section provides examples for each of the operations that ctxload can perform:

Text Load Example

The following example loads documents from the reviews.txt load file into table docs for user jsmith. It also writes log information to a file called log2.out. Because -commitafter was not specified, each row (document) is committed to the database after it is inserted into the docs table.

Also, because -separate was not specified, ctxload expects the text for each document to be embedded in the reviews.txt file.

ctxload -user jsmith/123abc -name docs -file 	review.txt -log log2.out

Document Update Example

The following UNIX-based example illustrates updating the LONG RAW column for the row identified by primary key 3452 in the table for a policy named word_docs. The column is updated with the contents of resume1.doc located in /docs:

ctxload -user ctxdemo/passwd -update -name word_docs -pk 3452 -file /docs/resume1.doc

Document Export Examples

The following UNIX-based example illustrates exporting the LONG RAW column for the row identified by primary key 3452 in the table for a policy named word_docs. The contents of the cell in the column are copied to a file named new.doc located in /docs:

ctxload -user ctxdemo/passwd -export -name word_docs -pk 3452 -file /docs/new.doc

The following example is identical to the preceding example, except the row is identified by a compound primary key consisting of a name and location. The name and location values are separate by a comma and the entire primary key string is enclosed in double quotation marks because the location value includes a space:

ctxload -user ctxdemo/passwd -export -name word_docs -pk "Smith,HQ 1" -file /docs/new.doc

Thesaurus Import Example

The following example imports a thesaurus named tech_doc from an import file named tech_thesaurus.txt:

ctxload -user jsmith/123abc -thes -name tech_doc -file 	tech_thesaurus.txt

Thesaurus Export Example

The following example dumps the contents of a thesaurus named tech_doc into a file named tech_thesaurus.out:

ctxload -user jsmith/123abc -thesdump -name tech_doc -file tech_thesaurus.out

Structure of Text Load File

The load file must use the following format for each document, as well as any structured data associated with the document:

<TEXTSTART: col_name1=doc_data, col_name2=doc_data,...col_nameN=doc_data>
text. . . 
<TEXTEND>

where:

<TEXTSTART: ... >

is a header marker that indicates the beginning of a document. It also may contain one or more of the following fields used to specify structured data for a document:

col_name

is the name of a column that will store structured data for the document.

doc_data

is the structured data that will be stored in the column specified in col_name.

text

is the text of the document to be loaded or the name (and location, if necessary) of an operating system file containing the text to be loaded.

Note:
The data in text (either a string of text or a file name pointer) is treated by ctxload as literal data, including any non-alphanumeric characters or blank spaces that may occur. As a result, you must ensure that text exactly represents the data you wish ctxload to process.
For example, if you use ctxload to load text from separate files, the file names in the load file must exactly represent the name(s) of the operating-system file(s) containing the text. If any blank spaces are included in a file name, ctxload cannot locate the file and the text is not loaded.

<TEXTEND>

indicates the end of the document.

Load File Structure

The following conditions apply to the structure of the load file:

for each document to be loaded, either the text of the document or a pointer to a separate file must be in the load file.
embedded text and separate file pointers cannot be used together in the same load file
if the text for your documents is embedded in the load file, the text must be in ASCII format
if pointers to separate files are used, the text in the files can be in plain (ASCII) format or a proprietary format (e.g. MS Word)
if the text in a separate file is in a proprietary format, the format must be supported by ConText and it must be loaded into a LONG RAW column
each separate file must contain a single document (the contents of a separate file are stored as a single row in the table)

Load File Syntax

The following conditions apply to the syntax utilized in the text load file:

<TEXTSTART: ... > and <TEXTEND> must each start on a new line
the structured data parameters within the <TEXTSTART: ... > string do not have to be in any particular order
a newline character (either hard or soft return) cannot occur between a col_name and the beginning of its associated doc_data

Note:
The entire value for doc_data does not have to be on the same line as the col_name; only the beginning of the value and the col_name must share the same line.
the first col_name should be on the same line as the 'TEXTSTART:'
the '>' character which indicates the end of the <TEXTSTART: ... > string must be on the same line as the last doc_data field for the document
structured and LONG data may span more than one line
single quote-marks must be escaped in doc_data (e.g. don't must be entered as don''t)
each <TEXTSTART: ... > string must be followed by the text of a document or a pointer to a separate file
the text or file pointer must be placed after the complete <TEXTSTART: ... > string and should start on a new line
the last character in the load file should be a newline character

Example of Embedded Text in Load File

The following example illustrates a correctly formatted text load file containing structured employee information, such as employee number (1000, 1024) and name (Joe Smith, Mary Jones), and the text for each document:

<TEXTSTART: EMPNO=1000, ELNAME='Smith', EFNAME='Joe'>
Joe has an interesting resume, includes...cliff-diving.
<TEXTEND>
<TEXTSTART: EMPNO=1024, EFNAME='Mary', ELNAME='Jones'>
Mary has many excellent skills, including...technical,
marketing, and organizational.  Team player.
<TEXTEND>

Example of File Name Pointers in Load File

<TEXTSTART: EMPNO=1024, EFNAME='Mary', ELNAME='Jones'>
mjones.doc
<TEXTEND>
<TEXTSTART: EMPNO=1000, EFNAME='Joe', EFNAME='Smith'>
jsmith.doc
<TEXTEND>

Note:
To use the load file in this example, you would have to specify the -separate argument when executing ctxload.

Structure of Thesaurus Import File

The import file must use the following format for entries in the thesaurus:

phrase
BT broader_term
NT narrower_term1
NT narrower_term2
. . .
NT narrower_termN

BTG broader_term
NTG narrower_term1
NTG narrower_term2
. . .
NTG narrower_termN

BTP broader_term
NTP narrower_term1
NTP narrower_term2
. . .
NTP narrower_termN

BTI broader_term
NTI narrower_term1
NTI narrower_term2
. . .
NTI narrower_termN

SYN synonym1
SYN synonym2
. . .
SYN synonymN
USE|SEE synonym1

RT related_term1
RT related_term2
. . .
RN related_termN

SN text

where:

phrase

is a word or phrase that is defined as having synonyms, broader terms, narrower terms, and/or related terms.

In compliance with ISO-2788 standards, a TT marker can be placed before a phrase to indicate that the phrase is the top term in a hierarchy; however, the TT marker is not required. In fact, ctxload ignores TT markers during import.

In ConText, a top term is identified as any phrase that does not have a broader term (BT, BTG, BTP, or BTI).

Note:
The thesaurus query operators (SYN, PT, BT, BTG, BTP, BTI, NT, NTG, NTP, NTI, and RT) are reserved words and, thus, cannot be used as phrases in thesaurus entries.
In addition, the string 'E$_' is reserved for internal use and cannot be used as a phrase in thesaurus entries.

BT, BTG, BTP, BTI

are the markers that indicate broader_termN is a broader (generic|partitive|instance) term for phrase.

NT, NTG, NTP, NTI

are the markers that indicate narrower_termN is a narrower (generic|partitive|instance) term for phrase.

If phrase does not have a broader (generic|partitive|instance) term, but has one or more narrower (generic|partitive|instance) terms, phrase is created as a top term in the respective hierarchy (in a ConText thesaurus, the BT/NT, BTG/NTG, BTP/NTP, and BTI/NTI hierarchies are separate structures).

SYN

is a marker that indicates phrase and synonymN are synonyms within a synonym ring.

Note:
Synonym rings are not defined explicitly in ConText thesauri. They are created by the transitive nature of synonyms.

USE | SEE

are markers that indicate phrase and synonymN are synonyms within a synonym ring (similar to SYN); however, USE | SEE also indicates synonymN is the preferred term for the synonym ring. Either marker can be used to define the preferred term for a synonym ring.

RT

is the marker that indicates related_termN is a related term for phrase.

SN

is the marker that indicates the following text is a scope note (i.e. comment) for the preceding entry.

broader_termN

is a word or phrase that conceptually provides a more general description or category for phrase. For example, the word elephant could have a broader term of land mammal.

narrower_termN

is a word or phrase that conceptually provides a more specific description for phrase. For example, the word elephant could have a narrower terms of indian elephant and african elephant.

synonymN

is a word or phrase that has the same meaning for phrase. For example, the word elephant could have a synonym of pachyderm.

related_termN

is a word or phrase that has a meaning related to, but not necessarily synonymous with phrase. For example, the word elephant could have a related term of wooly mammoth.

Note:
Related terms are not transitive. If a phrase has two or more related terms, the terms are related only to the parent phrase and not to each other.

Alternate Hierarchy Structure

In compliance with thesauri standards, the load file supports formatting hierarchies (BT/NT, BTG/NTG, BTP, NTP, BTI/NTI) by indenting the terms under the top term and using NT (or NTG, NTP, NTI) markers that indicate the level for the term:

phrase
   NT1 narrower_term1
      NT2 narrower_term1.1
      NT2 narrower_term1.2
          NT3 narrower_term1.2.1
          NT3 narrower_term1.2.2
   NT1 narrower_term2
   . . .
   NT1 narrower_termN

Using this method, the entire branch for a top term can be represented hierarchically in the load file.

Import File Structure for Terms

The following conditions apply to the structure of the entries in the import file:

each entry (phrase, BT, NT, or SYN) must be on a single line followed by a newline character
entries can consist of a single word or phrases
the maximum length of an entry (phrase, BT, NT, or SYN) is 255 characters, not including the BT, NT, and SYN markers or the newline characters
entries cannot contain parentheses or plus signs.
each line of the file that does not start with the BT, NT, and SYN markers indicates a phrase
a phrase can occur more than once in the file
each phrase can have one or more narrower term entries (NT, NTG, NTP), broader term entries (BT, BTG, BTP), synonym entries, and related term entries
each broader term, narrower term, synonym, and preferred term entry must start with the appropriate marker and the markers must be in capital letters
the broader terms, narrower terms, and synonyms for a phrase can be in any order
holographs must be followed by parenthetical disambiguators everywhere they are used

For example: cranes (birds), cranes (lifting equipment)
compound terms are signified by a plus sign between each factor (e.g. buildings + construction)
compound terms are allowed only as synonyms or preferred terms for other terms -- never as terms by themselves, or in hierarchical relations.
terms can be followed by a scope note (SN), total maximum length of 2000 characters, on subsequent lines

multi-line scope notes are allowed, but require an SN marker on each line of the note

Example of Incorrect SN usage:

VIEW CAMERAS
SN Cameras with through-the lens focusing and a
range of movements of the lens plane relative to
the film plane

Example of Correct SN usage:

VIEW CAMERAS
SN Cameras with through-the lens focusing and a
SN range of movements of the lens plane relative
SN to the film plane

Multi-word terms cannot start with reserved words (e.g. use is a reserved word, so use other door is not an allowed term; however, use is an allowed term)

Import File Structure for Relationships

The following conditions apply to the relationships defined for the entries in the import file:

related term entries must follow a phrase or another related term entry
related term entries start with the RT marker, followed by white space, then the related term on the same line

multiple related terms require multiple RT markers

Example of incorrect RT usage:

MOVING PICTURE CAMERAS
RT CINE CAMERAS 
TELEVISION CAMERAS

Example of correct RT usage:

MOVING PICTURE CAMERAS
RT CINE CAMERAS
RT TELEVISION CAMERAS

Terms are allowed to have multiple broader terms, narrower terms, and related terms

Examples of Import Files

This section provides three examples of correctly formatted thesaurus import files.

Example 1 (Flat Structure)

cat
SYN feline
NT domestic cat
NT wild cat
BT mammal
mammal
BT animal
domestic cat
NT Persian cat
NT Siamese cat
wild cat
NT tiger
tiger
NT Bengal tiger
dog
BT mammal
NT domestic dog
NT wild dog
SYN canine
domestic dog
NT German Shepard
wild dog
NT Dingo

Example 2 (Hierarchical)

animal
   NT1 mammal
        NT2 cat
             NT3 domestic cat
                  NT4 Persian cat
                  NT4 Siamese cat
             NT3 wild cat
                  NT4 tiger
                       NT5 Bengal tiger
        NT2 dog
             NT3 domestic dog
                  NT4 German Shepard
             NT3 wild dog
                  NT4 Dingo
cat
SYN feline
dog
SYN canine

Example 3

35MM CAMERAS
BT MINIATURE CAMERAS
CAMERAS
BT OPTICAL EQUIPMENT
NT MOVING PICTURE CAMERAS
NT STEREO CAMERAS
LAND CAMERAS
USE VIEW CAMERAS
VIEW CAMERAS
SN Cameras with through-the lens focusing and a range of 
SN movements of the lens plane relative to the film plane
UF LAND CAMERAS
BT STILL CAMERAS

10Text Loading Utility