SysChat

SysChat (http://www.syschat.com/forum.php)
-   Backup and Data Recovery (http://www.syschat.com/tutorials/backup-and-data-recovery/)
-   -   How to split large Code, Text and Database files with GSplit (http://www.syschat.com/how-split-large-code-text-database-8198.html)

DominicD 02-28-2014 05:31 AM

How to split large Code, Text and Database files with GSplit
 
How to split large Code, Text, and Database files with GSplit

Are you a web developer working daily with hundreds to thousands of lines of code? Are you a web blogger and needing to backup, import and export your Wordpress databases? Are you an Analyst, needing to occasionally work with hundreds of megabytes of data exports an databases? Are you a Database Administrator and need to manage your ever growing database records?

Ever had the nightmare of growing these data files so big that data export, import, and even recovery is impossible due to file size, and application limits!?

http://i57.tinypic.com/zx7iug.gif
GSplit is a powerful and free file splitter that lets you split your large files into a set of smaller files called pieces. It also creates a Self-Uniting program that automatically restores the original file with no requirement. GSplit includes a lot of customization features for easily and safely splitting your files.

I would like to share the tutorial below as I have personally encountered the critical need to split and restore a large text file of 44 million database records and restore it! Of course, next to this task is the complication of choking the production database as having it retrieve all 44 million records gobbles up the CPU and memory!! Wait a few more minutes while it does the import of the large database dump of 44 million records and it crashes the entire database!

I will share with you the tutorials that I have found to be helpful in splitting text/csv files with GSplit.

PROBLEM:
A large text file of 2.4 Gigabytes needs to be loaded into the SQL Database. Taking this large file and importing it to the database takes a very very long time (at least 40 minutes of plain waiting!). Adding complexity to the task is that I get repetitive and unexplained errors after SQL has attempted to copy the 34-millionth record! SQL just crashes at the 34 million mark and aborts the copy.

This crash happens because typical programs like Notepad, Microsoft Excel, Microsoft Access, and even SQL Server attempts to take the 2.4 gigabyte text file and immediately load it into ram and memory, and from there copy the contents to the actual database or display it on screen. While this typical behavior works fine for most file types, a file size this big chokes the already running SQL database and competes for RAM memory. Also, a file this large hits the memory limitations of Office Publishing Apps like Excel and Access.

OBSERVATIONS:
Observing that the copying process always ends at the 34-millionth record, I determine that there must be invalid characters or data corruption somewhere in the 34-millionth record and possibly onwards from that point..

Now data corruption is immediately a bad topic, and as with anyone working with databases, we prefer all our records intact!

Since the text file is so huge, Notepad could not open it. Microsoft Excel cannot load the file as it exceeds the limits of rows! Even Microsoft Access could not load the file as the row count and raw data contents exceeds the maximum file size that Access can load.

Even the developer friendly Notepad++ software could not load the data.
Surely, loading the plain text file and inspecting the problem with the 34-millionth row would seem like the easiest solution!


STRATEGY:
Load the text file in batches -- splitting the text file between the suspected corruption of data or invalid data entry at the 34-millionth record!
And this is where GSplit comes to the rescue!

Unfortunately for MS Excel, MS Access, and even the enterprise tools such as Microsoft SQL Server, loading a large database dump is not very customizable -- you can only load the whole text file/ csv file / sql script file, OR not load it at all! There are no options like load only the first few records, or load only the last few records, or specify only which records to load!


THE SOLUTION: Split the large file with GSplit!
The solution I found is to split the large dump of text file with GSplit.


1. Download GSplit from GSplit - File Splitter - Split Any File - Split Text and Log Files
http://i59.tinypic.com/14o94b7.png
I choose to download the portable .ZIP file. This way, I can save the GSplit installer files even to USB and network drives.


2. Run GSplit, and locate the file that you want to split
http://i60.tinypic.com/bey8ed.jpg
Load the file that you want to split. Click on the Original File link and choose the file to split.


3. Select where you want to save the split files. Click Destination Folder
http://i61.tinypic.com/312aaea.jpg


4. Specify the splitting options
http://i57.tinypic.com/juvnew.jpg
Select Blocked Pieces
- this makes GSplit build multiple pieces of the file according to the specific size and type that we want.

Selecting Spanned Pieces will split the files across the contents of a storage media like a usb flash drive, or cd, and dvd. In this scenario, this is not the ideal split that we need.

We choose Blocked Pieces because we want to specify the split of the files after the probable line where the 34th-million record may be corrupted.


We choose the option of block pieces splitting to I want to split after the nth occurrence of a specified pattern
http://i60.tinypic.com/e7fkgi.png

Since we know the possible line number or record number of the investigated data corruption, we specify the line number of the area within the file where we want to split to.

We specify the pattern is 0x0D0x0A
This pattern is a code that means line-break or "next line" in hexadecimal code

In more details (derived from the GSplit documentation)
Code:

A pattern can contain alphanumeric characters only. To specify other characters not normally permitted, you must use the "0x" command followed by the 2-digit hexadecimal sequence that refers to the corresponding ASCII character code you want. Example: 0x40 denotes the @ character. You can use characters from 0x00 to 0xFF ; see the ASCII table here. The 2 digits must be specified: 0xA is incorrect, use 0x0A. Unicode sequences are not supported.

Patterns may have up to 256 characters.

 Examples of patterns:
  • newline characters (see http://en.wikipedia.org/wiki/Newline for more information):

  • LF (Line Feed, 0x0A), CR (Carriage Return, 0x0D), or CR followed by LF (CR+LF, 0x0D0x0A)

  • For Windows, use this pattern: 0x0D0x0A.

  • data separator characters: | (0x7C) ; (0x3B)...

  • others: [email protected] my0x40separator

If you are splitting a log file, a CSV file or any text file with a header, you may also want to insert a header in each piece file.

5. Click on the File Names link
http://i59.tinypic.com/1z31m4w.jpg
In the “piece name mask” field enter “{ofw}_{num}{ore}”(without quotes). This will generate pieces with the original file extension and the original file names that look like “filename_1.csv”. Alternately, you could leave this alone altogether, go with the default generated names, and simply rename the extension to “.txt” (or whatever your original extension is) in Windows explorer.

Follow the sample syntax noted on GSplit for more details.
This option is very powerful in terms of automatically setting the prefix filenames of the many split files that will be generated.


6. Specify the most important splitting parameter!
Add a check mark to Do not add GSplit Tags to split files
http://i62.tinypic.com/16j1lyo.jpg
Since our aim is to split the large dump of text files acquired from a database, we do not want the split files to contain any other data that may compromise the integrity of its data. Put a check mark on this so that the split files are clean of any additional GSplit meta-data.

If you are splitting a file for storage purposes, it is important that you Add the GSplit tags option. This ensures that GSplit can re-combine the files when you need them back to be one single piece.


7. Now we to split large dump of text file into multiple parts according to our specified pattern
http://i60.tinypic.com/i3yo11.jpg
Click [b]Split


8. Once the splitting process is finished you will see the “Splitting log” screen. Click on the “Open the folder in Windows Explorer” link to instantly jump to the output folder.
http://i59.tinypic.com/2gy68gy.jpg


9. If you are going to repeat this splitting process in the future you might consider saving these settings as a “profile” that can be loaded when you need it so that you do not have to go through these process again. To do so select “Save a Profile As” from the “File” Menu.
http://i59.tinypic.com/34g0dv8.jpg


================================================== ====

Where else can you use GSplit?

CSV, SQL, TXT Databases and Large Data Dumps -- Importing chunk by chunk of a huge data dump into a development database. This should help you troubleshoot and inspect possible data corruption. Or if you simply want to view the exported data base contents through a text editor like Notepad or Notepad++.

Importing data in smaller chunks also takes a lesser burden on live production and development databases. Import data by batches according to the split of files that you make with GSplit.

Log files -- System and server log files grow huge and complex overtime. Use GSplit to manage the log file sizes and/or to extract relevant data that you need to investigate. With GSplit, if you know the keywords of the log file that you need, you can effectively split amongst those lines and automatically generate smaller log files off the specific topic of your investigation

Programmer Source Code -- If you need to export portions of code that are repetitively used within the code base, or you need to trim your inspection to just sections of the source code, simply add your source code as the "pattern" in GSplit and begin to cut and zoom into your source code.

General File Splitting and storage AND/OR Spanning across multiple storage devices -- Send your files via email, instant messaging, and online hosting services without size restriction. Share files over the Internet easier and avoid large downloads which may become truncated. Transfer any file from a computer to another using floppies or any other removable storage device. Backup large files on several CD, DVD or USB sticks.


All times are GMT -4. The time now is 07:10 PM.


Copyright © 2005-2013 SysChat.com


1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54