Dealing with code pages

A code page is the mapping of values to characters, and something you have to think about if you’re dealing with some Windows things. The Command Prompt and filenames, deal with particular code pages. There’s an input code page, which handles what our programs get, and there’s the output code page, which interprets our output to display or store it.

Different distributions of Windows use different default code pages. Code Page 437, for instance, was the original IBM PC hardware code page. Some distributions of Windows are localized, so they have a code page appropriate for that language. That’s called the OEM Code Page (or just OEMCP).

You can check your code page with the chcp command:

C:> chcp
Active code page: 437

There are some ANSI Code Pages (ACP), officially called Windows Code Pages since they aren’t actually standards. You might have run into CP-1252, which handles English and some Western languages.

You can change the code page inside the Perl program. The Win32::Console modules allows you to do that:

#!perl
# change_output_cp.pl
use Win32::Console;

my $CONSOLE = Win32::Console->new;
$CONSOLE->OutputCP( $ARGV[0] );

There’s an additional wrinkle. When I change the code page from within Perl, I modify the console and it stays that way after my program finishes. That’s different from the Unix model where child processes can’t change things above them.

Z:\> chcp
Active code page: 437

Z:\> perl change_output_cp.pl 65001

Z:\> chcp
Active code page: 65001

If you are fooling around with code pages, you probably want to reset the code page to whatever it was when you started.

#!perl
# restore_output_cp.pl
use Win32::Console;

my $CONSOLE = Win32::Console->new;
my $original = $CONSOLE->OutputCP;
$CONSOLE->OutputCP( $ARGV[0] );

...;

$CONSOLE->OutputCP( $original );

Another curious feature is that I can output to more than one code page in a single program and see the right things in Command Prompt:

#!perl
# show_code_pages.pl

use v5.10;
$|++;

use Win32::Console;
my $CONSOLE = Win32::Console->new;

say "Input code page is ", $CONSOLE->InputCP();
say "Output code page starts as ", 
	my $original = $CONSOLE->OutputCP();

foreach my $arg ( @ARGV ) {
	$CONSOLE->OutputCP( $arg ) or do {
		warn "Could not change to CP $arg\n$!\n$^E";
		next;
		};

	say "\nOutput code page is now ",  $CONSOLE->OutputCP();
	foreach my $point ( 0 .. 255 ) {
		my $char = chr( $point );
		$char = '.' if( $char =~ /\s/ or $point eq 8 );
		print $char, ' ';
		print "\n" unless ( $point + 1 ) % 16
		}
	}

$CONSOLE->OutputCP( $original );

Even though I output the same characters for each code page, the console displays them differently according to the current code page.

The common advice to display UTF-8 is to change to CP-65001, but that doesn’t quite work here. The ASCII range works, but the 8-bit characters show up as boxes:

I have to bring in some of Perl’s Unicode features. The -C switch turns on some of the things I need. The -CS encodes the standard file handles as UTF-8. When I add that, I get the right characters, but some weirdness in how the lines show up. For the second half of the output, the end of the previous line is repeated.

I suspect this is a bug in the Window’s console handling of UTF-8, and that CP-65001 is part of the issue. It’s not just Command Prompt that has this issue; WinBash shows the same behavior. Various people have made vague references to issues with CP-65001 but never got down to the actual issue, which I think is very deep in the architecture. Part of this is the console’s inability to use a proper font.

If I turn on autoflushed buffers with $|++, the line wrapping problem disappears, but another problem shows up. Each code number above 127 takes up two spaces:

Powershell (and Powershell ISE) is even more deficient in displaying these characters. WinBash doesn’t do any better.

Further reading

Leave a comment

2 Comments.

  1. I have WIN7 (ver 6.1.7601),
    Strawberry Perl (v5.18.1)

    This code provide good output in cmd window:

    binmode STDOUT ,":unix:utf8";
    system ("chcp 65001");
    
    • That part isn’t the problem, The Command Prompt can only use a limited number of fonts. The octets can come out as UTF-8, but the terminal doesn’t have glyphs for them or can’t draw them correctly. Almost every person I’ve seen who as given this answer hasn’t tried to use anything other than basic characters.

Leave a Reply

Your email address will not be published. Required fields are marked *