Thanks everyone who offered guidance. I was finally able to overcome all the obstacles necessary to deal with this VERY cumbersome blob of data. Here is a synopsis of the factors:
The source provided a 7GB UTF-8 encoded text file. The file was structured as one record per line (136 fields), data delimited with Ascii(20) and qualified with Ascii(254). I eventually needed to get a field/record subset of this data into an ANSI ONLY legal database.
The first step is obtaining a Helper Class which could deal the with the beast of a large file. I was able to obtain this very nice "HugeBinaryFile.cls" from here: http://www.vbforums.com/showthread.php?t=531321
It uses the 64 bit VB 'Currency' datatype as the file pointer.
The following functions were also crucial to the 'byteArray to string' related actions (you could just as easily retained the Unicode if you needed to preserve any potential double byte character sets.)
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Public Function ByteArrayToString(bytArray() As Byte) As String
Dim i As Long
Dim sAns As String
sAns = StrConv(bytArray, vbUnicode)
i = InStr(sAns, Chr(0))
If i > 0 Then
ByteArrayToString = Left(sAns, i - 1)
Else
ByteArrayToString = sAns
End If
End Function
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
'Declare in your module
Public Declare Function MultiByteToWideChar Lib "kernel32" (ByVal CodePage As Long, ByVal dwFlags As Long, ByVal lpMultiByteStr As String, _
ByVal cchMultiByte As Long, ByVal lpWideCharStr As String, _
ByVal cchWideChar As Long) As Long
Public Const CP_UTF8 = 65001
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Function UTF8ToANSI(ByRef instring As String) As String
Dim i As Long
Dim L As Long
Dim temp As String
L = Len(instring)
temp = String(L * 2, 0)
i = MultiByteToWideChar(CP_UTF8, 0, instring & Chr(0), -1, temp, L)
If i > 0 Then
UTF8ToANSI = StrConv(Left(temp, (i - 1) * 2), vbFromUnicode)
i = InStr(UTF8ToANSI, Chr(0))
If i Then UTF8ToANSI = Left(UTF8ToANSI, i - 1)
Else
UTF8ToANSI = instring
End If
End Function
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
This main body of code reads buffer sampled "chunks" from the big source file. It ignores the first 3 bytes of UTF-8 BOM, It then searches for the end of record CRLF chars within the body of the sampled chunks; if the end of rec chars are not detected then the code doubles the size of the buffer and re-reads the bytes from the current record pointer position ~ in this way the buffer only gets as big as it needs to be. Currently, there is no protection to prevent the buffer from growing overly large, but as this is a one time deal it can be added in for other projects.
Any detected records are then converted to ANSI text and exported to a 1GB text file segement. The code will continue exporting out incremented 1GB ansi text files until the source file has been completely analyzed.
Of special note, I was not able to get the 'InstrB' to work as suggested directly with arrays. Further searching seemed to indicate that it would need to be converted to string first and then run (which defeats the purpose for me when I needed to maintain an accurate byte pointer of the source file where the CRLF's occur). Converting sampled byteArray to/from string changes the number of bytes. I ended up simply inspecting the sampled byte array within a For Each loop to detect the CRLF's to maintain relative file pointer info.
I have commented out the references to form Label objects which I used for my own feedback. I do have my own 'ParsingEngine module' to implement but that is outside the scope of this post. The resulting data can now be easily imported and auto-parsed into table(s) by something like MS Access 2010. From there querying the data is straight forward.
Once again - thanks everyone for your useful direction!
~Steve
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
Private Sub cmdUseBigFileClass_Click()
'== <Huge File Class> Related Variables =========
Dim hbfFile As HugeBinaryFile
Dim bytBuf() As Byte
Dim lngBufferSize As Long
lngBufferSize = 4096 'Starting Buffer Size 4KB
ReDim bytBuf(1 To lngBufferSize)
lblCurBufSize.Caption = Format(lngBufferSize, "###,###,###,###")
Dim strExportPath As String 'Exported ANSI File Location
Dim intFragNum As Integer 'Exported ANSI File Fragment #
Dim intExportFileNum As Integer
Dim strTempUniString As String
Dim strTempAnsiString As String
Dim curABSNextRecPointer As Currency '( Rec Ptr in Abs Byte Position)
Dim lngRelEORPos As Long 'Relative End Of Rec Position(current chunk)
Dim strCurRecLine As String
Dim strHeaderRec As String
Dim curRecCtr As Currency
Dim vntTemp As Variant, vntLast As Variant, lngRelCtr As Long
intExportFileNum = FreeFile()
Open strExportPath & "_" & CStr(intFragNum) & ".txt" For Output As intExportFileNum
Set hbfFile = New HugeBinaryFile
hbfFile.OpenFile strSourcePath 'Defined Elsewhere via file browse object
'lblBytes.Caption = Format(hbfFile.FileLen, "##,###,###,###,##0")
DoEvents
curABSNextRecPointer = 3 'Initialize @ First Byte beyond UTF-8 BOM (0~2)
Do Until (curABSNextRecPointer >= hbfFile.FileLen - 2) Or (hbfFile.EOF)
'Get Into Next Rec Pointer Position
hbfFile.SeekAbsolute curABSNextRecPointer
hbfFile.ReadBytes bytBuf 'Read the next Multi-K Byte sample
'Manually Inspect the byte Array for next [Chr(13)+Chr(10)] Rec Terminator
'This does job of "InstrB" directly with Byte Array
vntLast = 0
lngRelEORPos = 0
lngRelCtr = 0
For Each vntTemp In bytBuf ' Iterate through each element.
lngRelCtr = lngRelCtr + 1 'Used For Relative File Pointer Location
If vntTemp = 13 Then vntLast = vntTemp 'CR Found
If vntTemp = 10 Then 'Line Feed Found
If vntLast = 13 Then
'We found the Rec Terminator!
lngRelEORPos = lngRelCtr - 1 'Store Relative Position
Exit For
End If
End If
Next
If lngRelEORPos > 0 Then
'Next Record Resides at this ABS Byte position.
curABSNextRecPointer = curABSNextRecPointer + lngRelEORPos + 1
'Truncate the Byte Array to exclude the CRLF part
'of the current record data
ReDim Preserve bytBuf(1 To lngRelEORPos - 1)
strTempUniString = ByteArrayToString(bytBuf)
strTempAnsiString = UTF8ToANSI(strTempUniString)
ReDim bytBuf(1 To lngBufferSize) 'Return to set buffer size
strCurRecLine = strTempAnsiString
curRecCtr = curRecCtr + 1
If curRecCtr = 1 Then
'Store Header Record for all exported file 1GB fragments
strHeaderRec = strCurRecLine
Print #intExportFileNum, strHeaderRec
Else
'Export Current ANSI Record
Print #intExportFileNum, strCurRecLine
End If
If LOF(intExportFileNum) >= (2 ^ 30) Then
'The Currently Exported ANSI file is at 1GB, Close it
'and start a new file fragment.
Close intExportFileNum
intFragNum = intFragNum + 1
intExportFileNum = FreeFile()
Open strExportPath & "_" & CStr(intFragNum) & ".txt" For Output As intExportFileNum
Print #intExportFileNum, strHeaderRec
End If
Else
'Warning, End of Record NOT detected within current
'byte array sample size!
'Double the current buffer size & Re-Try record read
lngBufferSize = lngBufferSize * 2
ReDim bytBuf(1 To lngBufferSize) 'Destructive Allocation
'lblCurBufSize.Caption = Format(lngBufferSize, "###,###,###,###")
End If
'If curRecCtr Mod 100 = 0 Then
'lblRecCount.Caption = Format(curRecCtr, "###,###,###,###")
'lblBytesRead.Caption = Format(curABSNextRecPointer, "##,###,###,###,##0")
'DoEvents
'End If
Loop
Close intExportFileNum
hbfFile.CloseFile
Set hbfFile = Nothing
'lblRecCount.Caption = Format(curRecCtr, "###,###,###,###")
'lblBytesRead.Caption = Format(curABSNextRecPointer, "##,###,###,###,##0")
End Sub