| Manipulating Strings with ObjectPAL |
|
|
| Contributed by Al Breveleri | |
| 22 January 2002 | |
|
Manipulating Strings with ObjectPAL © 2002 Al Breveleri Previous Section: Part 1: General Considerations and Part 2: Searching In A String 3. Parsing Grammatically 3.1. Importing Delimited Text Data Delimited text data is sometimes preferred over fixed column text data because the records can vary in length which usually makes the file shorter. Data is expressed as text, with numeric values written as their text representations. The data is organized into records, which are marked with a record-ender character (or sequence). This marking allows the records to vary in length. Text data files practically always use one line per record. Otherwise the file is almost impossible to handle. All my examples assume that a line-ender is used as a record-ender. The records are organized into fields, which are marked with a field separator character. The separator is most often a comma or a tab, but anything may be used. A difficulty immediately arises if the data in any field happens to contain the field-separator character. For example, a comma or tab may appear in a string, or a comma may appear in a number. If this cannot be avoided, then a field delimiter character must be defined. Typically the quote (") character is used. Any text bracketed by a pair of delimiters is defined as one field of data, even if it contains separators. Now a difficulty arises if the any data happens to contain a field delimiter character. There is no good solution to this problem, but it is apparently much easier to avoid delimiters in data than to avoid separators in data. Go figure. A delimited text data file may be specified as having no delimiter defined, as having a delimiter used to bracket all data fields, or as having a delimiter used to bracket only those fields that need it. I think it is best to be prepared for delimiters to appear around any field in any record. 3.1.1. Reading Lines from a Text File Composed of Lines It is possible to locate the beginning and end of the next line in the textstream input and then extract the fields directly from the file. However, this is excessively complicated as all the patterns would need to be constructed to stop at the end of a line. Where a file naturally breaks into lines that can be treated separately (as in a typical delimited text import), it is easier to read each line into a string variable for further processing, even though this leads to moving the data an extra time. You could read the entire file into a string variable and then apply the breakApart() method, but there are some fussy detail difficulties in dealing with some files using "\n" as a line ender, while others may use "\r\n". The textstream.readLine() method irons out this difference. (ReadLine() actually reads thru the "\n", then discards the trailing "\n" from the input, then discards any trailing "\r".) If you need to have all lines available when processing each line, read the file into a 'lines' string array first. If the lines are independent then just process each line before reading the next -- no 'lines' array is necessary. Listing 2: Reading a text file when you know none of the lines will be greater than 1023 chars long ; assuming gtsSRC has been opened globally
; as the input textstream
proc PROCESS_TEXT_1()
var
psLINE string
; ... other variables as needed to process line
endvar
gtsSRC'home()
while not gtsSRC'eof()
gtsSRC'readLine(psLINE)
; ...
; ... process line now in psLINE
; ...
endwhile
endproc
Listing 3: Reading a text file when you know none of the lines will be greater than 32767 chars long; assuming gtsSRC has been opened globally
; as the input textstream
proc PROCESS_TEXT_2()
var
psLINE string
piANCHOR, piFNDBGN, piFNDEND longint
; ...
endvar
gtsSRC'home()
piFNDEND = 1
while true
piANCHOR = piFNDEND
piFNDBGN = piANCHOR
; At this point, the variables piANCHOR, piFNDBGN,
; and piFNDEND all point to the start of a line, where
; we want the next search to begin.
if not gtsSRC'advMatch(piFNDBGN, piFNDEND, "(\r\n)|(\r)|(\n)") then
; If a line ender is found, advMatch sets piFNDBGN
; to point to the first char of the line ender and
; piFNDEND to point to the first char after it*. If
; a line ender is not found, the next two statements
; set piFNDBGN and piFNDEND to point to the first
; char after the end of the file.
piFNDBGN = size(gtsSRC)+1
piFNDEND = piFNDBGN
endif
; Now, piANCHOR, piFNDBGN, and piFNDEND can be
; compared to determine what was found, and what
; action should be taken:
; case piFNDBGN piFNDEND action
; ---------- ---------- ---------- ----------
; no text, no line ender = piANCHOR = piFNDBGN quit (end of file)
; no text. but line ender = piANCHOR next line bgn process empty line
; text but no line ender curr line end = piFNDBGN process line
; text and line ender curr line end next line bgn process line
; ---------- ---------- ---------- ----------
; quit if piFNDBGN=piANCHOR and piFNDEND=piFNDBGN
if piFNDEND=piANCHOR then quitloop endif
if piFNDBGN=piANCHOR then
psLINE = blank()
else
gtsSRC'setPosition(piANCHOR)
gtsSRC'readChars(psLINE,piFNDBGN-piANCHOR)
endif
; ...
; ... process line now in psLINE
; ...
endwhile
endproc
Following is a modification of the above to cope with the fact that strings can be 2GB long but the textstream.readChars() method is restricted to 32KB per read.Listing 4: Reading a text file in the general case, when you hope none of the lines will be greater than 2147483647 chars long ; assuming gtsSRC has been opened globally
; as the input textstream
proc PROCESS_TEXT_3()
var
psLINE, psBFFR string
piANCHOR, piFNDBGN, piFNDEND longint
piREMAINING longint
; ...
endvar
gtsSRC'home()
piFNDEND = 1
while true
piANCHOR = piFNDEND
piFNDBGN = piANCHOR
if not gtsSRC'advMatch(piFNDBGN,piFNDEND, "(\r\n)|(\r)|(\n)") then
piFNDBGN = size(gtsSRC)+1
piFNDEND = piFNDBGN
endif
; case piFNDBGN piFNDEND action
; ---------- ---------- ---------- ----------
; no text, no line ender = piANCHOR = piFNDBGN quit (end of file)
; no text. but line ender = piANCHOR next line bgn process empty line
; text but no line ender curr line end = piFNDBGN process line
; text and line ender curr line end next line bgn process line
; ---------- ---------- ---------- ----------
if piFNDEND=piANCHOR then quitloop endif
if piFNDBGN=piANCHOR then
psLINE = blank()
else
gtsSRC'setPosition(piANCHOR)
; The textstream.readChars() method is restricted to
; 32767 chars per read. When there is no guarantee
; that all input lines will be shorter than that, we
; need to use a loop to read in the line 32767 chars
; at a time.
piREMAINING = piFNDBGN-piANCHOR
psLINE = blank()
while piREMAINING>0
gtsSRC'readChars(psBFFR,int(min(piREMAINING,32767)))
psLINE = psLINE + psBFFR
piREMAINING = piREMAINING-32767
endwhile
endif
; ...
; ... process line now in psLINE
; ...
endwhile
endproc
3.1.2. Breaking the Fields Out of a Record LineIf the data has separators but no delimiters, then the separator character cannot appear in any data. The breakApart() method will extract the fields without further sophistication. This is a special case that doesn't come up often in general data entry, but can be reliable when you also control the export that produces the data. Listing 5: Extracting fields from a record with separators and no delimiters. This code snippet is a candidate replacement for the 'process line now in psLINE' section above. ; assuming separator is in gsSEP
proc PROCESS_TEXT_2()
var
; ...
pasFIELDS array [] string
II longint
; ... other variables as needed to process line
endvar
; ...
; ... get next line into psLINE
; ...
; If the subject string ends with a separator
; character, breakApart() does not generate a
; corresponding final element after the separator.
; By appending a separator character to the end of the
; subject string, we force a final element. If the
; subject string does not end with a separator
; character, appending a separator has no effect.
breakApart(psLINE+gsSEP,pasFIELDS,gsSEP)
for II from 1 to size(pasFIELDS)
; ...
; ... next datum is pasFIELDS[II]
; ...
endfor
endproc
Generally, 'delimited text data' has both delimiters and separators. The whole point of using delimiters is so separators can appear in string field data. Separator characters within delimited fields must be ignored. This means that the delimiters must be located first. Here is a technique using the string.breakApart() method.Listing 6: Extracting fields from a record in the general case, with both separators and delimiters. This code snippet is a candidate replacement for the 'process line now in psLINE' section above. ; assuming separator is in gsSEP and delimiter in gsDLM
proc PROCESS_TEXT_3()
var
; ...
pasTOKENS, pasFIELDS array [] string
II, JJ longint
; ... other variables as needed to process line
endvar
; ...
; ... get next line into psLINE
; ...
psLINE'breakApart(pasTOKENS,gsDLM)
; Now, even numbered items in pasTOKENS were inside
; delimited fields, and odd numbered items were
; everything between delimited fields.
for II from 1 to size(pasTOKENS) step 2
; process text outside quotes from pasTOKENS[II]
breakApart(pasTOKENS[II]+gsSEP,pasFIELDS,gsSEP)
for JJ from iif(II=1,1,2) to iif(II=size(pasTOKENS), size(pasFIELDS), size(pasFIELDS)-1)
; ...
; ... next datum is pasFIELDS[JJ] (not delimited)
; ...
endfor
; Check for an odd number of items in pasTOKENS --
; this happens whenever the last field in a record is
; not delimited.
if II<>size(pasTOKENS) then
; process text inside quotes
; from pasTOKENS[II+1]
; ...
; ... next datum is pasTOKENS[II+1] (was delimited)
; ...
endif
endfor
endproc
3.2. Finding SGML Tags in a Text FileHere, as opposed to the delimited text data case, it's a waste of time to consider the file in terms of lines. Even a single tag may cross a line boundary. It is best to search the text file for the tag location, then use textstream.setPosition() and textstream.readChars() to extract the tag. 3.2.1. Finding a Single Tag Listing 7: Find the next '<XXX ...>' tag in an opened textstream after the current position. ; assuming gtsSRC has been opened globally
; as the input textstream
proc FIND_TAG()
var
; ...
psTAGSTR string ; tag will be copied to this variable
piBGNPSN, piENDPSN longint
endvar
; ...
; start searching at the current position
piBGNPSN = gtsSRC'position()
if gtsSRC'advMatch(piBGNPSN,piENDPSN, "<XXX([ \t\r\n]+[^>]*)?>") then
gtsSRC'setPosition(piBGNPSN)
gtsSRC'readchars(psTAGSTR,piENDPSN-piBGNPSN)
; tag with attributes is now in psTAGSTR
; current file position is now first char after the tag
else
; tag not found
endif
endproc
The pattern is intended to match the 'XXX' tag with or without attributes. Here is how the pattern was built up.
Listing 8: Find the next '<XXX ...>...</XXX>' tag pair in an opened textstream after the current position. ; assuming gtsSRC has been opened globally
; as the input textstream
proc FIND_TAG_PAIR()
var
; ...
psTAGSTR string ; gets tag pair and all enclosed text
piBGNPSN, piTMPPSN, piENDPSN longint
endvar
; ...
; start searching at the current position
piBGNPSN = gtsSRC'position()
if gtsSRC'advMatch(piBGNPSN,piTMPPSN, "<XXX([ \t\r\n]+[^>]*)?>") then
if gtsSRC'advMatch(piTMPPSN,piENDPSN,"</XXX>") then
gtsSRC'setPosition(piBGNPSN)
gtsSRC'readchars(psTAGSTR,piENDPSN-piBGNPSN)
; tag pair with attributes and enclosed text is
; now in psTAGSTR current file position is now first
; char after the closing tag
else
; tag not found
endif
else
; tag not found
endif
endproc
This proc will work properly only if tags of the specified name are never nested. When the proc encounters nested tag pairs, it incorrectly matches the next opening tag found and the next closing tag found, because those are the first it sees.3.2.3. Finding Balanced Tag Pairs The easiest way I know to find balanced nested tag pairs is to find all the opening and closing tags first and list them by location. Furthermore, this is the fastest way I know of to accomplish this task. Okay, it's the only way I know how to do it. It's probably close to the best technique, though. Listing 9: Here's how to use a dynarray to find the first balanced '<XXX...>...</XXX>' tag pair after the current file position when tags of this type may be nested. ; assuming gtsSRC has been opened globally
; as the input textstream
proc FIND_BALANCED_TAG_PAIR()
var
; ...
pdsTAGS dynarray [] string
psBFFR, psTAGTXT string
piANCHOR, piFNDBGN, piFNDEND longint
piLEVEL longint
endvar
; ...
; preclear the list of opening and closing tags
pdsTAGS'empty()
; record the search start position
; (current read position in this example)
piANCHOR = gtsSRC'position()
; Find all the opening tags of the specified name.
piFNDEND = piANCHOR
while true
piFNDBGN = piFNDEND
; Construction of the pattern is described
; in section 3.2.1.
if not gtsSRC'advMatch(piFNDBGN,piFNDEND, "<XXX([ \t\r\n]+[^>]*)?>") then
quitloop
endif
; Record the opening tag text, without the '<' and '>'.
gtsSRC'setPosition(piFNDBGN+1)
gtsSRC'readChars(psBFFR,piFNDEND-piFNDBGN-2)
; Use the tag beginning location as
; the dynarray index for this entry.
pdsTAGS[format("w10,ez",piFNDBGN)] = psBFFR
endwhile
; Find all the closing tags of the specified name.
piFNDEND = piANCHOR
while true
piFNDBGN = piFNDEND
; I trust this pattern is obvious.
if not gtsSRC'advMatch(piFNDBGN,piFNDEND,"</XXX>") then
quitloop
endif
; For each closing tag, enter "/" in the dynarray.
; This facilitates discriminating between opening and
; closing tags. Use the tag ending location as the
; dynarray index for this entry.
pdsTAGS[format("w10,ez",piFNDEND)] = "/"
endwhile
; Scan the dynarray. The string variable 'psTAGTXT' will
; be blank until an opening tag is seen. As soon as that
; happens, begin incrementing 'piLEVEL' for each opening
; tag and decrementing it for each closing tag. When
; 'piLEVEL' becomes zero again, the matching closing
; tag has been found.
psTAGTXT = blank()
piLEVEL = 0
foreach psBFFR in pdsTAGS
if pdsTAGS[psBFFR]<>"/" then ; opening tag
if psTAGTXT=blank() then
piFNDBGN = longint(psBFFR)
psTAGTXT = pdsTAGS[psBFFR]
endif
piLEVEL = piLEVEL+1
else ; closing tag
if psTAGTXT<>blank() then
piLEVEL = piLEVEL-1
if piLEVEL<=0 then
piFNDEND = longint(psBFFR)
quitloop
endif
endif
endif
endforeach
if psTAGTXT=blank() then
; ...
; ... no opening tag was found
; ...
else
if piLEVEL<>0 then
; ...
; ... ERROR -- no matching closing tag was found
; ...
else
; piFNDBGN =
; file position of first char in opening tag
; piFNDEND =
; file position of first char after closing tag
; psTAGTXT =
; text of opening tag and attributes without '<' '>'
; ...
; ... process tag pair
; ...
endif
endif
; ...
endproc
Part 4: Replacing Parts and Part 5: Building Long Strings |
| < Prev | Next > |
|---|





