[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Omaha.pm] Hello Perl Gurus

The short answer is that you are working with UTF-8 formatted files and the first record in your file contains 3 extra bytes (called the Byte Order Mark or BOM) to mark the file as Unicode. However, you are not telling Perl to treat them as UTF-8 files. You need to either:
  1. Save your files in ASCII (which will probably break hospitals.csv since one of the hospital names contains a Unicode character).
  2. Tell Perl to read the files as UTF-8.
To do #2, you just can change the file open lines to:

open (HOSPITALS, '<:utf8', "hospitals.csv") or die $!;
open (PATIENTS, '<:utf8', "september.csv") or die $!;

Another option is to use File::BOM to add BOM detection to your script.

As a positive side-effect to the change suggested above, the 3-argument version of open is safer than the 2-arg version if you ever decide to use a variable name in your filenames (e.g., "$month.csv").

Perl has excellent Unicode support (better than most), but, for whatever reason, it does not have built-in BOM detection for input files. (It does detect BOM for script files it will be executing, just not for regular input files.)


On Fri, Nov 13, 2015 at 11:33 AM Simons, Tony <ts-pm@tvortex.net> wrote:
Please excuse the test message in reply to Paul's Message.  I did some printf's in the code to test the output.  The result Paul is seeing is only on the first record and it's happening in the first occurance of:

my $firstDir = substr ($patientId...

if I print the values of $patientId and $hospitalId before the substr the data appears to be correct.

I also tried something since the data is numeric in nature.  I tried:
my $patientId = int $ids[0];

 which resulted in the following as an error since it's not text.


So it appears to be something that's happening with the substr and the data in the file.  I see no special characters in the file itself using vi :set list  I also did a dos2unix on the file to make sure it's using the right format.   I have read that there are problems with perl and files in UTF-8 format.  Is that a potential problem?
Omaha-pm mailing list
Sterling Hanenkamp