[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Omaha.pm] One-liner file clean-up
PROBLEM:
Given a file like this:
--------
A<B>01100 Metabolism</B>$
B$
B <B>01110 Carbohydrate Metabolism</B>$
C$
C 00010 Glycolysis / Gluconeogenesis [PATH:sac00010]$
D$
D <a href="/dbget-bin/www_bget?sac:SACOL1604">SACOL1604</a> glk;
glucokinase [EC:2.7.1.2]; <a
href=/dbget-bin/www_bget?ko+K00845>K00845</a> glucokinase $
D <a href="/dbget-bin/www_bget?sac:SACOL0966">SACOL0966</a> pgi;
glucose-6-phosphate isomerase [EC:5.3.1.9]; <a
href=/dbget-bin/www_bget?ko+K01810>K01810</a> glucose-6-phosphate
isomerase $
--------
Strip out all the HTML, and the leading capital letter and spaces. So
it ends up looking like this:
--------
01100 Metabolism
01110 Carbohydrate Metabolism
00010 Glycolysis / Gluconeogenesis [PATH:sac00010]
SACOL1604 glk; glucokinase [EC:2.7.1.2]; K00845 glucokinase
SACOL0966 pgi; glucose-6-phosphate isomerase [EC:5.3.1.9]; K01810
glucose-6-ph
osphate isomerase
--------
SOLUTION:
$ perl -pe 's/<.*?>//g; s/^[A-Z] *//;' filename.txt
Grin,
j