by sergeyken » Mon Jun 26, 2023 12:56 am
One of many possible ways to do it.
This must be designed even before selection of the most suitable tool to implement it!
Step 1.
Create a modified copy of the source data:
- define the size of the “meaningful” part of each record (without trailing blanks),
- re-order the records, to make the records with the same “meaningful size” grouped together.
Step 2.
Create so called “full outer join” of the two files (all pairs of records excluding records joined to its own copies)
Step 3.
Process the (huge!) joined file, eliminating those pairs where the string from the first part of record is also a substring from its second part.
If the joined file is sorted by both left part size, and second part size, then each pass of Step 3 may be stopped as soon as the length of the left part becomes longer than the size of the right part (to optimize this time-consuming process)
Step 4.
Produce the output resulting set of records, using left parts of all non-rejected records from Step 3.
That’s it. As for myself, I could implement it:
- using COBOL, PL/I, Assembler, C/C++, and some other compiled languages,
- using REXX, or other interpreted language,
- using SORT facility, or maybe other file-processing tool (like FileAid? not sure about the details),
- highly likely there are also other available tools to implement the desired algorithm.
Javas and Pythons come and go, but JCL and SORT stay forever.