How to handle LARGE UTF-8 file

Discussion:

(too old to reply)

s***@yahoo.com

2012-03-08 16:05:04 UTC

Hi folks,

I recently had a large text file approaching 7GB in size dropped on me. The contents of which are supposed to be delimited text field data from a database. It's prohibitive size will not let me open it in a robust text editor so I've just sampled the first 32K out of it via opening it as a Binary file with 'Get & Put'. This at least allowed me to see what I was dealing with.

The entity who provided the data has shut down all responsibility for the data so I cannot optionally ask for the data in another format.

The little 32K subset of text turned out to be Encoded UTF-8 text with the EF BB BF header and is comprised of some 166 fields of delimited data. At least some subset of this data will eventually need to be loaded into an older legal database which only supports ANSI text.

I've tried loading the entire thing into an Office 2010 Access database, but because the text is UTF8 Encoded it seems to insist that it is loading an XML document and errors out during load. My hope was to export out only the fields we need in ANSI format.

Because the UTF8 format is not double-byte unicode all of the time (best I can tell from my research) I cannot simply step thru the data and consistently ignore the 'extra' byte.

I experimented with 'StrConv' with no success for getting ANSI text out of sampled pieces of text.

My goal is to step thru this text file and export out some more manageable 2GB ANSI segments or some such approach.

Can anyone offer any suggestions on how I can achieve my goal?

TIA !

~Steve

Deanna Earley

2012-03-08 16:55:11 UTC

Permalink

Post by s***@yahoo.com
Hi folks,
I recently had a large text file approaching 7GB in size dropped on
me. The contents of which are supposed to be delimited text field
data from a database. It's prohibitive size will not let me open it
in a robust text editor so I've just sampled the first 32K out of it
via opening it as a Binary file with 'Get& Put'. This at least
allowed me to see what I was dealing with.
The little 32K subset of text turned out to be Encoded UTF-8 text
with the EF BB BF header and is comprised of some 166 fields of
delimited data. At least some subset of this data will eventually
need to be loaded into an older legal database which only supports
ANSI text.

While the data may be UTF-8 format, will it actually contain any "non
ascii" text?
UTF-8 and ASCII are identical for the first 128 code points.

You can check this be reading chunks (into a byte array) and scanning
for values > 127.

--
Deanna Earley (***@icode.co.uk)
i-Catcher Development Team
http://www.icode.co.uk/icatcher/

iCode Systems

(Replies direct to my email address will be ignored.
Please reply to the group.)

Bob Butler

2012-03-08 18:13:29 UTC

Permalink

Post by Deanna Earley

While the data may be UTF-8 format, will it actually contain any "non
ascii" text?
UTF-8 and ASCII are identical for the first 128 code points.
You can check this be reading chunks (into a byte array) and scanning for
values > 127.

If it does have any special characters you should be able to leverage the
WideCharToMultiByte API call to convert from UTF8 to Unicode and then figure
out what to do with the special characters for inserting into the database.

s***@yahoo.com

2012-03-08 18:49:26 UTC

Permalink

-- Deanna,
While the data may be UTF-8 format, will it actually contain
any "non ascii" text? UTF-8 and ASCII are identical for the
first 128 code points.

Unfortunately this is part of the conumdrum. This is a delimited text export where the source used ASCII(20) to delimit the fields and ASCII(254) to qualify the field content.

-- Bob,
If it does have any special characters you should be able to leverage
the WideCharToMultiByte API call to convert from UTF8 to Unicode and
then figure out what to do with the special characters for inserting
into the database.

Thanks all ! I will look into using the 'WideCharToMultiByte' api call to see if I can parse this data into ANSI text.

~Steve

Farnsworth

2012-03-08 21:00:55 UTC

Permalink

Besides what others suggested, check this link to see how the characters are
encoded:

http://en.wikipedia.org/wiki/Utf-8#Description

So ASCII 254(1111 1110) =

Byte 1: 110 00011 = &HC3
Byte 2: 10 111110 = &HBE

Farnsworth

2012-03-08 21:05:30 UTC

Permalink

Post by Farnsworth
Besides what others suggested, check this link to see how the
http://en.wikipedia.org/wiki/Utf-8#Description
So ASCII 254(1111 1110) =
Byte 1: 110 00011 = &HC3
Byte 2: 10 111110 = &HBE

I made a mistake in the byte order, so it should be the other way around:

Byte 1: 110 11110 = &HDE
Byte 2: 10 000111 = &H87

s***@yahoo.com

2012-03-09 01:51:43 UTC

Permalink

Farnsworth,

Your first reply, byte order actually seems to match my sample data.

ASCII(254)
UTF-8 Two Byte Representation: 1100 0011 1011 1110 &HC3BE

I haven't currently digested the detailed UTF-8 Wiki explanation yet and I hopefully won't have to unless I end up needing to write my own UTF-8 record decoder.

I am hoping to merely strip out the Byte Order Mark(BOM) &HEFBBBF,inspect for end of record &H0D0A (one line = one record), and pass that to the afore mentioned API call.

Thanks,

~Steve

Post by Farnsworth

Byte 1: 110 11110 = &HDE
Byte 2: 10 000111 = &H87

Farnsworth

2012-03-09 04:32:24 UTC

Permalink

Post by s***@yahoo.com
Farnsworth,
Your first reply, byte order actually seems to match my sample data.
ASCII(254)
UTF-8 Two Byte Representation: 1100 0011 1011 1110 &HC3BE
I haven't currently digested the detailed UTF-8 Wiki explanation yet
and I hopefully won't have to unless I end up needing to write my own
UTF-8 record decoder.
I am hoping to merely strip out the Byte Order Mark(BOM)
&HEFBBBF,inspect for end of record &H0D0A (one line = one record),
and pass that to the afore mentioned API call.

If you look at the list at Wiki article, you notice each of the extra bytes
is always >= 128, so you can read a large chunk, 1MB+, and you would know if
you need to read few extra bytes or not if the last byte is >=128.

As for CR LF, InStrB can be used for byte arrays. Example:

Debug.Print InStrB(arr, vbCrLf)

Finally, check ParseCSV01 routine at this page to parse the lines:

http://www.xbeat.net/vbspeed/c_ParseCSV.php

Schmidt

2012-03-09 06:32:45 UTC

Permalink

I am hoping to merely strip out the Byte Order Mark(BOM)&HEFBBBF,
inspect for end of record&H0D0A (one line = one record),
and pass that to the afore mentioned API call.

That's the right approach.

In a refined, speedoptimized version you could read
even entire record-groups in 32kB-chunks.

So what I'd do is, search for a helper-class which can
read 7GB-Files (which is using currency-Types for the
FilePositionPointer - there's some of them floating
around in the Web).

Read 32kB or 64kB ByteArray-chunks from the file
(skipping the UTF8-BOM on the first read chunk of curse).

First action on a yet *undecoded* ByteArray-Chunk would
be, to loop backwards until you find the vbLF-character,
to determine the ending of the last "fully contained"
record" within the current chunk.

Adapt your absolute 64Bit FilePointer-Position-Variable
to this last records vbLF-Position, so that you know
from where to read the next FileChunk.

Shorten the ByteArray of the current chunk, to exclude
this last found vbCr+vbLF marker at the end of the chunk
from the ByteArray per Redim Preserve.

Then decode the entire (shortened) ByteArray from UTF8 to
a normal VB-WideString (BStr) using multibytetowidechar-API.

Then do a normal VB-Split-Command on the decoded String
using vbCrLf.

The resulting String-Array now contains properly decoded
16Bit wide Unicode-String-Records which you can loop over
from Indexes 0 to Ubound(StrArray) to do your Record-processing.

Keep in mind, that the VB-strings in this array now contain
real Unicode - and *not* ANSI - e.g. the Euro-Sign (when contained)
would be present in these Strings as 16Bit AscW-Value 8364 (&H20AC).

So I would try to deal in your conversion routines with that
fact - and not attempt any additional ANSI-conversion from
these already nicely converted 16Bit-Unicode-WStrings, this
Records-StringArray now contains.

Olaf

Farnsworth

2012-03-09 18:40:38 UTC

Permalink

Post by Schmidt
So what I'd do is, search for a helper-class which can
read 7GB-Files (which is using currency-Types for the
FilePositionPointer - there's some of them floating
around in the Web).

I forgot that VB won't go beyond 2 GB in files open For Binary mode. One
such class I used in the past is CFile.cls in this VB6 source library:

http://sourceforge.net/projects/codebox/

It has most API functions for File I/O, and easy to use methods. It's
limited to 32-Bit though, but you can easily add extra parameters for the
higher 32-Bit.

s***@yahoo.com

2012-03-14 15:54:05 UTC

Permalink

Thanks everyone who offered guidance. I was finally able to overcome all the obstacles necessary to deal with this VERY cumbersome blob of data. Here is a synopsis of the factors:

The source provided a 7GB UTF-8 encoded text file. The file was structured as one record per line (136 fields), data delimited with Ascii(20) and qualified with Ascii(254). I eventually needed to get a field/record subset of this data into an ANSI ONLY legal database.

The first step is obtaining a Helper Class which could deal the with the beast of a large file. I was able to obtain this very nice "HugeBinaryFile.cls" from here: http://www.vbforums.com/showthread.php?t=531321
It uses the 64 bit VB 'Currency' datatype as the file pointer.

The following functions were also crucial to the 'byteArray to string' related actions (you could just as easily retained the Unicode if you needed to preserve any potential double byte character sets.)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Public Function ByteArrayToString(bytArray() As Byte) As String

Dim i As Long
Dim sAns As String

sAns = StrConv(bytArray, vbUnicode)
i = InStr(sAns, Chr(0))
If i > 0 Then
ByteArrayToString = Left(sAns, i - 1)
Else
ByteArrayToString = sAns
End If

End Function
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
'Declare in your module
Public Declare Function MultiByteToWideChar Lib "kernel32" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpMultiByteStr As String, _
ByVal cchMultiByte As Long, ByVal lpWideCharStr As String, _
ByVal cchWideChar As Long) As Long

Public Const CP_UTF8 = 65001
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Function UTF8ToANSI(ByRef instring As String) As String
Dim i As Long
Dim L As Long
Dim temp As String

L = Len(instring)
temp = String(L * 2, 0)
i = MultiByteToWideChar(CP_UTF8, 0, instring & Chr(0), -1, temp, L)
If i > 0 Then
UTF8ToANSI = StrConv(Left(temp, (i - 1) * 2), vbFromUnicode)
i = InStr(UTF8ToANSI, Chr(0))
If i Then UTF8ToANSI = Left(UTF8ToANSI, i - 1)
Else
UTF8ToANSI = instring
End If
End Function
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

This main body of code reads buffer sampled "chunks" from the big source file. It ignores the first 3 bytes of UTF-8 BOM, It then searches for the end of record CRLF chars within the body of the sampled chunks; if the end of rec chars are not detected then the code doubles the size of the buffer and re-reads the bytes from the current record pointer position ~ in this way the buffer only gets as big as it needs to be. Currently, there is no protection to prevent the buffer from growing overly large, but as this is a one time deal it can be added in for other projects.

Any detected records are then converted to ANSI text and exported to a 1GB text file segement. The code will continue exporting out incremented 1GB ansi text files until the source file has been completely analyzed.

Of special note, I was not able to get the 'InstrB' to work as suggested directly with arrays. Further searching seemed to indicate that it would need to be converted to string first and then run (which defeats the purpose for me when I needed to maintain an accurate byte pointer of the source file where the CRLF's occur). Converting sampled byteArray to/from string changes the number of bytes. I ended up simply inspecting the sampled byte array within a For Each loop to detect the CRLF's to maintain relative file pointer info.

I have commented out the references to form Label objects which I used for my own feedback. I do have my own 'ParsingEngine module' to implement but that is outside the scope of this post. The resulting data can now be easily imported and auto-parsed into table(s) by something like MS Access 2010. From there querying the data is straight forward.

Once again - thanks everyone for your useful direction!

~Steve
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

Private Sub cmdUseBigFileClass_Click()

'== <Huge File Class> Related Variables =========
Dim hbfFile As HugeBinaryFile
Dim bytBuf() As Byte
Dim lngBufferSize As Long

lngBufferSize = 4096 'Starting Buffer Size 4KB
ReDim bytBuf(1 To lngBufferSize)
lblCurBufSize.Caption = Format(lngBufferSize, "###,###,###,###")

Dim strExportPath As String 'Exported ANSI File Location
Dim intFragNum As Integer 'Exported ANSI File Fragment #
Dim intExportFileNum As Integer
Dim strTempUniString As String
Dim strTempAnsiString As String
Dim curABSNextRecPointer As Currency '( Rec Ptr in Abs Byte Position)
Dim lngRelEORPos As Long 'Relative End Of Rec Position(current chunk)
Dim strCurRecLine As String
Dim strHeaderRec As String
Dim curRecCtr As Currency
Dim vntTemp As Variant, vntLast As Variant, lngRelCtr As Long

intExportFileNum = FreeFile()
Open strExportPath & "_" & CStr(intFragNum) & ".txt" For Output As intExportFileNum

Set hbfFile = New HugeBinaryFile
hbfFile.OpenFile strSourcePath 'Defined Elsewhere via file browse object
'lblBytes.Caption = Format(hbfFile.FileLen, "##,###,###,###,##0")
DoEvents

curABSNextRecPointer = 3 'Initialize @ First Byte beyond UTF-8 BOM (0~2)

Do Until (curABSNextRecPointer >= hbfFile.FileLen - 2) Or (hbfFile.EOF)

'Get Into Next Rec Pointer Position
hbfFile.SeekAbsolute curABSNextRecPointer
hbfFile.ReadBytes bytBuf 'Read the next Multi-K Byte sample

'Manually Inspect the byte Array for next [Chr(13)+Chr(10)] Rec Terminator
'This does job of "InstrB" directly with Byte Array
vntLast = 0
lngRelEORPos = 0
lngRelCtr = 0
For Each vntTemp In bytBuf ' Iterate through each element.
lngRelCtr = lngRelCtr + 1 'Used For Relative File Pointer Location
If vntTemp = 13 Then vntLast = vntTemp 'CR Found
If vntTemp = 10 Then 'Line Feed Found
If vntLast = 13 Then
'We found the Rec Terminator!
lngRelEORPos = lngRelCtr - 1 'Store Relative Position
Exit For
End If
End If
Next

If lngRelEORPos > 0 Then

'Next Record Resides at this ABS Byte position.
curABSNextRecPointer = curABSNextRecPointer + lngRelEORPos + 1

'Truncate the Byte Array to exclude the CRLF part
'of the current record data
ReDim Preserve bytBuf(1 To lngRelEORPos - 1)

strTempUniString = ByteArrayToString(bytBuf)
strTempAnsiString = UTF8ToANSI(strTempUniString)
ReDim bytBuf(1 To lngBufferSize) 'Return to set buffer size
strCurRecLine = strTempAnsiString
curRecCtr = curRecCtr + 1
If curRecCtr = 1 Then
'Store Header Record for all exported file 1GB fragments
strHeaderRec = strCurRecLine
Print #intExportFileNum, strHeaderRec
Else
'Export Current ANSI Record
Print #intExportFileNum, strCurRecLine
End If
If LOF(intExportFileNum) >= (2 ^ 30) Then
'The Currently Exported ANSI file is at 1GB, Close it
'and start a new file fragment.
Close intExportFileNum
intFragNum = intFragNum + 1
intExportFileNum = FreeFile()
Open strExportPath & "_" & CStr(intFragNum) & ".txt" For Output As intExportFileNum
Print #intExportFileNum, strHeaderRec
End If
Else
'Warning, End of Record NOT detected within current
'byte array sample size!

'Double the current buffer size & Re-Try record read
lngBufferSize = lngBufferSize * 2
ReDim bytBuf(1 To lngBufferSize) 'Destructive Allocation
'lblCurBufSize.Caption = Format(lngBufferSize, "###,###,###,###")
End If

'If curRecCtr Mod 100 = 0 Then
'lblRecCount.Caption = Format(curRecCtr, "###,###,###,###")
'lblBytesRead.Caption = Format(curABSNextRecPointer, "##,###,###,###,##0")
'DoEvents
'End If

Loop

Close intExportFileNum
hbfFile.CloseFile
Set hbfFile = Nothing

'lblRecCount.Caption = Format(curRecCtr, "###,###,###,###")
'lblBytesRead.Caption = Format(curABSNextRecPointer, "##,###,###,###,##0")

End Sub