|
Answer» What I have is a folder full of Protein Database FILES (.pdb) that are code in the following pattern:
HEADER OXIDOREDUCTASE 27-FEB-12 4DXH TITLE HORSE LIVER ALCOHOL DEHYDROGENASE COMPLEXED WITH NAD+ AND 2,2,2- TITLE 2 TRIFLUOROETHANOL COMPND MOL_ID: 1; COMPND 2 MOLECULE: ALCOHOL DEHYDROGENASE E CHAIN; COMPND 3 CHAIN: A, B; COMPND 4 EC: 1.1.1.1 SOURCE MOL_ID: 1; SOURCE 2 ORGANISM_SCIENTIFIC: EQUUS CABALLUS; SOURCE 3 ORGANISM_COMMON: DOMESTIC HORSE,EQUINE; SOURCE 4 ORGANISM_TAXID: 9796; SOURCE 5 STRAIN: DOMESTIC HORSE; SOURCE 6 ORGAN: LIVER; SOURCE 7 OTHER_DETAILS: LIVER KEYWDS ALCOHOL DEHYDROGENASE, NAD+,TRIFLUOROETHANOL, MICHAELIS COMPLEX KEYWDS 2 ROSSMANN FOLD, OXIDOREDUCTASE EXPDTA X-RAY DIFFRACTION AUTHOR B.V.PLAPP,S.RAMASWAMY REVDAT 4 27-JUN-12 4DXH 1 JRNL REVDAT 3 16-MAY-12 4DXH 1 JRNL REVDAT 2 02-MAY-12 4DXH 1 TITLE REVDAT 1 11-APR-12 4DXH 0
What I want to do is GREP out the title, author, compound, realease date (HEADER), and the source. As you can see the SOURCE for example is on multiple lines, what I want to do is create a table listing the categories mentioned above to generate a table with columns telling me the information. I cannot figure out how to group all the SOURCE (and any other category with multiple lines) into one line....Er... use awk.
|