[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Omaha.pm] One-liner file clean-up



PROBLEM:

Given a file like this:

--------
A<B>01100 Metabolism</B>$
B$
B  <B>01110 Carbohydrate Metabolism</B>$
C$
C    00010 Glycolysis / Gluconeogenesis [PATH:sac00010]$
D$
D <a href="/dbget-bin/www_bget?sac:SACOL1604">SACOL1604</a> glk; glucokinase [EC:2.7.1.2]; <a href=/dbget-bin/www_bget?ko+K00845>K00845</a> glucokinase $ D <a href="/dbget-bin/www_bget?sac:SACOL0966">SACOL0966</a> pgi; glucose-6-phosphate isomerase [EC:5.3.1.9]; <a href=/dbget-bin/www_bget?ko+K01810>K01810</a> glucose-6-phosphate isomerase $
--------

Strip out all the HTML, and the leading capital letter and spaces. So it ends up looking like this:

--------
01100 Metabolism

01110 Carbohydrate Metabolism

00010 Glycolysis / Gluconeogenesis [PATH:sac00010]

SACOL1604 glk; glucokinase [EC:2.7.1.2]; K00845 glucokinase
SACOL0966 pgi; glucose-6-phosphate isomerase [EC:5.3.1.9]; K01810 glucose-6-ph
osphate isomerase
--------


SOLUTION:

$ perl -pe 's/<.*?>//g; s/^[A-Z] *//;' filename.txt


Grin,

j