Splitting a very large file



IBM's flagship sort product DFSORT for sorting, merging, copying, data manipulation and reporting. Includes ICETOOL and ICEGENER

Splitting a very large file

Postby Aaron Chessell » Fri Aug 01, 2008 3:49 am

Hi,

We have a file of approx 30 to 40 million records with a LRECL=5493. The number of records can vary from run to run.

I need to split this file into smaller files of 1 million records each. The source file is on cartridge and the smaller files will also be going to cartridge. The source file is already sorted in the order that we want, so no sorting needs to be done.

I also need to add a header and footer to each file that is generated from splitting the file into smaller chunks.

The header will need to contain the following:

An identifier of ten 0's, then a blank, then a timestamp in the format YYYY-MM-DD-HH.MM.SS.TTTTTT (this timestamp basically needs to be the time that the job started and needs to be the same across all files) then the text 'ATO Temporary Resident Super File', then the number of the file (PIC 9(2)).

So the header for the fifth file would look like this:

0000000000 2008-08-01-15.30.32.123456ATO Temporary Resident Super File 05

The footer will need to contain the following:

An identifier of ten 9's then the number of records in the file (PIC 9(7)), then a count of the number of records in all the files up to this file (PIC 9(8)),

So the footer for the fifth file would look like this:

9999999999100000005000000

Lastly, how much temporary work space does ICETOOL need to do this. I know it is a lot of data and am not sure how much to allocate to help with efficiencies.

Any help very much appreciated.

Aaron
Aaron Chessell
 
Posts: 13
Joined: Tue Jul 22, 2008 4:29 am
Has thanked: 0 time
Been thanked: 0 time

Re: Splitting a very large file

Postby Frank Yaeger » Fri Aug 01, 2008 4:02 am

The source file is already sorted in the order that we want, so no sorting needs to be done.

how much temporary work space does ICETOOL need to do this. I know it is a lot of data and am not sure how much to allocate to help with efficiencies.


No temporary work space will be needed since you'll be doing a copy, not a sort. Work space is only needed for a sort, not for a copy or merge.

We have a file of approx 30 to 40 million records with a LRECL=5493. The number of records can vary from run to run.

I need to split this file into smaller files of 1 million records each. The source file is on cartridge and the smaller files will also be going to cartridge.


Are you looking to do this in one pass with 40 output DD statements (OUTFIL)? Or are you looking to do this in one pass with x output DD statements where x is determined dynamically from the number of input records? Or are you looking to use only one output DD statement per pass with 40 passes or x passes? You need to define the constraints and rules for what you're trying to do in more detail.
Frank Yaeger - DFSORT Development Team (IBM) - yaeger@us.ibm.com
Specialties: JOINKEYS, FINDREP, WHEN=GROUP, ICETOOL, Symbols, Migration
=> DFSORT/MVS is on the Web at http://www.ibm.com/storage/dfsort
User avatar
Frank Yaeger
Global moderator
 
Posts: 1079
Joined: Sat Jun 09, 2007 8:44 pm
Has thanked: 0 time
Been thanked: 15 times

Re: Splitting a very large file

Postby Aaron Chessell » Fri Aug 01, 2008 4:52 am

Hi Frank,

Sorry for my ignorance, but I'm not sure how we want to do it. Being new to ICETOOL I am not sure if it is more efficient to do it all in one pass or multiple passes or whether that has any impact at all. I will need your direction and advice on that. :? :oops: :?:

We would like to determine the number of output DD statements dynamically from the number of input records though.

Aaron
Aaron Chessell
 
Posts: 13
Joined: Tue Jul 22, 2008 4:29 am
Has thanked: 0 time
Been thanked: 0 time

Re: Splitting a very large file

Postby Frank Yaeger » Fri Aug 01, 2008 9:36 pm

I wasn't asking about how you want to do it in terms of efficiency - one pass would be more efficient than multiple passes. I was asking in terms of "cartridge handling". Unless you have 40 drives for the cartridges, you won't be able to handle 40 of them at the same time, so there would be mounts and dismounts involved which you would probably want to avoid (or maybe not).

We would like to determine the number of output DD statements dynamically from the number of input records though.


That would involve creating the DD statements (and possibly even entire job steps) dynamically as part of a job to be submitted to the internal reader. What you're asking for is quite involved and tricky, but Kolusu volunteered to take a look at it so he'll get back to you.
Frank Yaeger - DFSORT Development Team (IBM) - yaeger@us.ibm.com
Specialties: JOINKEYS, FINDREP, WHEN=GROUP, ICETOOL, Symbols, Migration
=> DFSORT/MVS is on the Web at http://www.ibm.com/storage/dfsort
User avatar
Frank Yaeger
Global moderator
 
Posts: 1079
Joined: Sat Jun 09, 2007 8:44 pm
Has thanked: 0 time
Been thanked: 15 times

Re: Splitting a very large file

Postby skolusu » Sat Aug 02, 2008 4:11 am

Aaron Chessell ,

The following DFSORT/ICETOOL JCL will give you the desired results. A brief explanation of the job.

1. First copy operator gets the count of total number of records in the input file.
2. Second copy operators uses this count file to calculate the number of the steps needed to created
3. The 3rd copy operator creates records using the repeat function
4. The last copy operator is the one which creates the dynamic JCL

Look at the output from the DD name OUT which would have the generated JCL. once you change the file names and your tid the job can be submitted via intrdr which actually does the split of the files.

Once you verified that JCL generated is correct, change the following
/OUT      DD SYSOUT=*   


to
/OUT      DD SYSOUT=(*,INTRDR)   



//STEP0100 EXEC PGM=ICETOOL
//TOOLMSG  DD SYSOUT=*     
//DFSMSG   DD SYSOUT=*     
//IN       DD DSN=your input tape file,
//            DISP=SHR
//T1       DD DSN=&&T1,DISP=(,PASS),SPACE=(CYL,(1,1))         
//T2       DD DSN=&&T2,DISP=(,PASS),SPACE=(CYL,(1,1))         
//OUT      DD SYSOUT=*                                         
//TOOLIN   DD *                                               
  COPY FROM(IN) USING(CTL1)                                   
  COPY FROM(T1) USING(CTL2)                                   
  COPY FROM(T1) USING(CTL3)                                   
  COPY FROM(T2) USING(CTL4)                                   
/*                                                             
//CTL1CNTL DD *                                               
  OUTFIL FNAMES=T1,REMOVECC,NODETAIL,                         
  TRAILER1=(COUNT=(M11,LENGTH=15))                             
/*                                                             
//CTL2CNTL DD *                                               
  INREC OVERLAY=(18:1,15,ZD,DIV,+1,EDIT=(TT))                 
  OUTFIL FNAMES=CTL3CNTL,                                     
  BUILD=(C' OPTION COPY ',/,                                   
    C' OUTFIL FNAMES=T2,REPEAT=',18,2,C',',/,                 
    C' BUILD=(SEQNUM,2,ZD,9C',                                 
    C'''',C'0',C'''',C',',/,C' C''',DATE4,C'.000000''',C',',/,
    C' C''',C'ATO TEMPORARY RESIDENT SUPER FILE',             
    C'''',C',X,SEQNUM,2,ZD,',/,                               
    C' C''',8,8,C'''',C')',80:X)                               
/*                                                             
//CTL3CNTL DD DSN=&&C1,DISP=(,PASS),SPACE=(CYL,(1,1))         
/*                                                             
//CTL4CNTL DD *                                                   
  OUTFIL FNAMES=OUT,                                             
  REMOVECC,IFOUTLEN=80,                                           
  HEADER1=('//TIDXXXXA JOB ','''','COPY''',/,                     
             '//             CLASS=A,',/,                         
             '//             MSGCLASS=Y,',/,                     
             '//             MSGLEVEL=(1,1),',/,                 
             '//             NOTIFY=&SYSUID',/,                   
             '//*'),                                             
  IFTHEN=(WHEN=(1,2,ZD,EQ,1),                                     
     BUILD=(C'//STEP0',SEQNUM,3,ZD,C' EXEC PGM=ICEMAN',/,         
            C'//SYSOUT   DD SYSOUT=*          ',/,               
            C'//SORTIN   DD DISP=SHR,DSN=YOUR INPUT TAPE',/,     
            C'//SORTOUT  DD DSN=YOUR OUTPUT FILE',SEQNUM,2,ZD,/, 
            C'//            DISP=(NEW,CATLG,DELETE),',/,         
            C'//            UNIT=TAPE,',/,                       
            C'//            VOL=(,,99)',/,                       
            C'//SYSIN    DD *',/,                                 
            C'  OPTION COPY',/,                                   
            C'  OUTFIL REMOVECC,STARTREC=00000001',               
            C',ENDREC=',+1000000,MUL,1,2,ZD,M11,LENGTH=8,C',',/, 
            C'  HEADER1=(',C'''',3,35,C'''',C',',/,               
            2X,C'''',38,36,C'''',C'),',/,                         
            C'  TRAILER1=(',C'''',C'000000000',C'''',C',',C'''', 
            +1000000,MUL,1,2,ZD,M11,LENGTH=8,C'''',C')',/,       
            C'//*',80:X)),                                       
  IFTHEN=(WHEN=NONE,                                             
     BUILD=(C'//STEP0',SEQNUM,3,ZD,START=2,INCR=1,               
            C' EXEC PGM=ICEMAN',/,                               
            C'//SYSOUT   DD SYSOUT=*          ',/,               
            C'//SORTIN   DD DISP=SHR,DSN=YOUR INPUT TAPE',/,     
            C'//SORTOUT  DD DSN=YOUR OUTPUT FILE',SEQNUM,2,ZD,   
            START=2,INCR=1,C',',/,                               
            C'//            DISP=(NEW,CATLG,DELETE),',/,         
            C'//            UNIT=TAPE,',/,                       
            C'//            VOL=(,,99)',/,                       
            C'//SYSIN    DD *',/,                                 
            C'  OPTION COPY',/,                                   
            C'  OUTFIL REMOVECC,STARTREC=',                       
            +1,ADD,((1,2,ZD,SUB,+1),MUL,+1000000),M11,LENGTH=8,   
            C',ENDREC=',+1000000,MUL,1,2,ZD,M11,LENGTH=8,C',',/, 
            C'  HEADER1=(',C'''',3,35,C'''',C',',/,               
            2X,C'''',38,36,C'''',C'),',/,                         
            C'  TRAILER1=(',C'''',C'000000000',C'''',C',',C'''', 
            +1000000,MUL,1,2,ZD,M11,LENGTH=8,C'''',C')',80:X)),   
  TRAILER1=('  OUTFIL FNAMES=REST,SAVE,',/,                       
            '  HEADER1=(',C'''',3,35,C'''',C',',/,               
            2X,C'''',38,34,C'''',C'),',/,                         
            C'  TRAILER1=(',C'''',C'000000000',C'''',C',',C'''', 
            74,8,C'''',C')',/,                                   
            '//REST     DD DSN=YOUR LAST RECORDS FILE,',/,       
            C'//            DISP=(NEW,CATLG,DELETE),',/,         
            C'//            UNIT=TAPE,',/,                       
            C'//            VOL=(,,99)',/,                       
            C'//*',80:X)                                         
/*                                                               
Kolusu - DFSORT Development Team (IBM)
DFSORT is on the Web at:
www.ibm.com/storage/dfsort
skolusu
 
Posts: 586
Joined: Wed Apr 02, 2008 10:38 pm
Has thanked: 0 time
Been thanked: 39 times

Re: Splitting a very large file

Postby dick scherrer » Sat Aug 02, 2008 5:39 am

Hello,

If there 37 million total records, how many times will the original input be read from the beginning? 38? First, the count, then 1 million, then 2 million, then 3 million, etc?

I believe this would translate into over 700million records read to get thru all of the records.

A lot of system resource could be saved if this was done with a custom program that would read the input file one time to create the smaller files. The first sort step to get the total record count is probably the fastest way to get the exact count (if the exact count is truly needed). I suspect a very close approximation of the record count could be gotten by getting the number of blocks written (the tape management system should have this) and multiplyng by the number of records per block.

While this would not be a sort solution, it would save a lot of extra system use that would be required to avoid writing a small amount of program code.

If i've misunderstood something, my apology.
Hope this helps,
d.sch.
User avatar
dick scherrer
Global moderator
 
Posts: 6268
Joined: Sat Jun 09, 2007 8:58 am
Has thanked: 3 times
Been thanked: 93 times

Re: Splitting a very large file

Postby dick scherrer » Sun Aug 03, 2008 2:31 am

Hi Frank,

Does your "custom program" get around the multiple output tape problem somehow?
Yes. In the past i've dealt with this 2 ways.

One is to define multiple files with dd statements and use UNIT=AFF to force them to the same drive rather than trying to allocate many drives. One must be careful to close one before opening another. . . ;)

The other is to "dynamically allocate" each output file as it is needed - forcing each to the a common drive (which might make a neat new feature ("STACK"?) for dfsort jobs that need to split very large volumes of data). As the files are closed and a new one opened, the code assigns the new dsn.

I've gotten away from writing tape for the most part, so i've not run one for a while. I suspect that both methods would still work. If someone wanted to test, the first method is accomplished totally by jcl. The second method was done using assembler as the new features weren't yet available in COBOL . . . As i said - it has been a while.

I don't know that it applies to this requirement, but we also used to allocate 2 tape drives for a single huge output file and "toggle" between them writing the output. While the "full" tape was being rewound and unloaded, writing continued on the "empty" tape saving the wait for the rewind, unload, and re-mount of a new scratch volume.

FWIW - using the sort to copy data runs so much faster than "own code" so if there is a way to use the sort and not make so many passes of the data, that would surely be the way to go.
Hope this helps,
d.sch.
User avatar
dick scherrer
Global moderator
 
Posts: 6268
Joined: Sat Jun 09, 2007 8:58 am
Has thanked: 3 times
Been thanked: 93 times

Re: Splitting a very large file

Postby Frank Yaeger » Mon Aug 04, 2008 5:50 am

Dick,

I actually deleted the post you refer to because I realized I didn't know enough about COBOL tape handling to ask the question intelligently. Obviously, you saw that post before I deleted it.

In the post I deleted, I pointed out that DFSORT could write all of the output data sets in parallel with one pass over the input data set if DDs for all of the output data sets could be supplied for the step, but I didn't think that would be feasible for tapes.

I understand that the COBOL program could OPEN/CLOSE the tapes serially whereas DFSORT OPENs them all in parallel which could make a difference in the tape handling. But I'm not sure that kind of tape handling in a COBOL program still qualifies as "writing a small amount of program code". However, I understand your point about efficiency.

Note also that STARTREC/ENDREC is not the most efficient way to do this with DFSORT since that requires reading all of the records in each step. Using SKIPREC/STOPAFT would be more efficient as it would stop when the STOPAFT count was reached for each step.

Another approach would be to use temporary disk output data sets to split up the files in one pass and then copy each disk file to a tape file. That could require reading less records with DFSORT then the other methods, but still might not be as good as the type of COBOL program you're talking about. Of course, DFSORT does use EXCP whereas COBOL doesn't so there's that efficiency to take into account. I guess the only way to know if the COBOL program would be better would be to try all of the approaches and compare.
Frank Yaeger - DFSORT Development Team (IBM) - yaeger@us.ibm.com
Specialties: JOINKEYS, FINDREP, WHEN=GROUP, ICETOOL, Symbols, Migration
=> DFSORT/MVS is on the Web at http://www.ibm.com/storage/dfsort
User avatar
Frank Yaeger
Global moderator
 
Posts: 1079
Joined: Sat Jun 09, 2007 8:44 pm
Has thanked: 0 time
Been thanked: 15 times

Re: Splitting a very large file

Postby dick scherrer » Mon Aug 04, 2008 8:13 am

Hi Frank,

But I'm not sure that kind of tape handling in a COBOL program still qualifies as "writing a small amount of program code".
Basically the program would have some number of SELECT and FD statements for the output files (loosely, these accomplish what a DCB does in assembler). N-1 of these could be cloned as they would all be quite similar.

The program code would ("at the top") open the input and the first of the output files. The input would be read and the output written until the million records were copied. The first output would be closed and the second opened for output. This would continue until all of the input had been copied. The copy code would also be cloned.

When all of the data had been copied, the input file and the last output file would be closed.

Using SKIPREC/STOPAFT would be more efficient as it would stop when the STOPAFT count was reached for each step.
The 700million records read is with reading only the records up to the million to copy each time thru. If this was not done (and the entire file passed each time) the total records read would be 37million records * 38 passes - ugh.

If Aaron's system uses some flavor of "virtual tape", that might be a workable alternative. A separate pool of virtual volumes could be set up to support this process. I'm sure there would still need to be some negotiation with the storage management people as there might be some reluctance to let a job use so many "drives" all at once - once upon a time i knew what a ucb "cost" but have forgotten - the good news is they can go "above the line". If it could be agreed that the copy job could use a smaller number of virtual drives (say 15), the copy could complete with only a few passes of the data and could still use the far faster i/o of the sort. Shucks, it they'd permit 45, only one copy pass to do it all :)

As you mentioned, using temporary dasd would also be attractive (provided the system can handle the spike).

For this data volume i surely would like to see the process use the sort rather than "own code".
Hope this helps,
d.sch.
User avatar
dick scherrer
Global moderator
 
Posts: 6268
Joined: Sat Jun 09, 2007 8:58 am
Has thanked: 3 times
Been thanked: 93 times

Re: Splitting a very large file

Postby Aaron Chessell » Wed Aug 13, 2008 10:08 am

Hi Frank & Dick,

Sorry that I have not responded to this thread......

We decided to use a NATURAL program to do the split. The reason we did that was due to time constraints (read - learning curve of icetool + impending deadline) and the fact that we are NATURAL programmers and it was a no brainer. The other reason was that we were only allowed to get access to 5 tape drives at a time. That includes both physical and virtual.

Reading Dick's comments above, it occured to me that we might have been able to use the tapes "sequentially" (ie UNIT=AFF) for the NATURAL program, however NATURAL has a limit of 32 files per program. So we would have had to run the job at least twice to get it all done if this method was able to be used.

Some good food for thought in here though.

Cheers,
Aaron
Aaron Chessell
 
Posts: 13
Joined: Tue Jul 22, 2008 4:29 am
Has thanked: 0 time
Been thanked: 0 time

Next

Return to DFSORT/ICETOOL/ICEGENER

 


  • Related topics
    Replies
    Views
    Last post