The ultimate guide for UTF-8 in irssi and GNU/Screen

Mar 06 2007 Published by Salvatore Iovene under Software

I’ve been hav­ing quite a lot of trou­ble, lately, con­fig­ur­ing irssi to work well with UTF-8. Irssi’s doc­u­men­ta­tion was quite incom­plete, on the mat­ter, or dis­cour­ag­ing, and there wasn’t much on the Inter­net, so, after fig­ur­ing out what the way is, I’ll share it here.

First of all, you’ve got to make sure that your sys­tem is con­fig­ured for UTF-8 locales:

bash-3.1$ locale
LANG=en_GB.utf8
LANGUAGE=en_GB.utf8
LC_CTYPE="en_GB.utf8"
LC_NUMERIC="en_GB.utf8"
LC_TIME="en_GB.utf8"
LC_COLLATE="en_GB.utf8"
LC_MONETARY="en_GB.utf8"
LC_MESSAGES="en_GB.utf8"
LC_PAPER="en_GB.utf8"
LC_NAME="en_GB.utf8"
LC_ADDRESS="en_GB.utf8"
LC_TELEPHONE="en_GB.utf8"
LC_MEASUREMENT="en_GB.utf8"
LC_IDENTIFICATION="en_GB.utf8"
LC_ALL=en_GB.utf8

If the out­put of the locale doesn’t look like that, you want to recon­fig­ure your locales. On Debian, wha you have do is:

sudo dpkg-reconfigure locales

Here’s some scree­nies of what to expect:

dpkg-1.png
dpkg-2.png
dpkg-3.png

Generating locales (this might take a while)...
  en_GB.ISO-8859-1... done
  en_GB.ISO-8859-15... done
  en_GB.UTF-8... done
  en_US.ISO-8859-1... done
  en_US.ISO-8859-15... done
  en_US.UTF-8... done
Generation complete.

Per­fect, now that our sys­tem is con­fig­ured for UTF-8, we want to con­fig­ure our ter­mi­nal emu­la­tor. If you’re using xterm, you can invoke it with the -u8 switch, or just do uxterm, and that’s all that’s needed. If you’re using the gnome-terminal, go to the Ter­mi­nal menu, then choose Set Char­ac­ter Encod­ing and then UTF-8. If UTF-8 doesn’t appear in the list, you may want to try to logout and login again. While you’re at it, in the GDM login man­ager, go to the Lan­guage option and choose UTF-8 there too, so that it will be default.

Now let’s take care of GNU/Screen. In order to enable UTF-8, all you have to do is launch it with the -U switch:

screen -U -S irc

irc is just the name I want to assign to that screen ses­sion. Notice that if you want to switch a liv­ing screen ses­sion to UTF-8, you could do it for each win­dow, using the com­mand CTRL-a : utf8 on.

Once your GNU/Screen is con­fig­ured for UTF-8, you have to finally set up your irssi client. This was, for me, the tricky part, since the doc­u­men­ta­tion is a bit unclear, and I didn’t real­ize that my irssi wasn’t built with recode sup­port. To make sure that your irssi is, fire it up and give the command

/recode

If you get some­thing like

Target                         Character set

then every­thing is alright, oth­er­wise, if you get a No such com­mand error, you will have to rein­stall irssi with recode sup­port.

Irssi UTF-8 sup­port is made so that you are able to recode to dif­fer­ent charsets, depend­ing on the server or chan­nel you’re chat­ting in. First let’s set up some gen­eral options:

/set term_charset UTF-8
/set recode_autodetect_utf8 ON
/set recode_fallback UTF-8
/set recode ON
/set recode_out_default_charset UTF-8
/set recode_transliterate ON

These options will be the default, unless over­rid­den for spe­cific servers or chan­nels. What do they mean?

  • term_charset: this is the char­ac­ter set of your ter­mi­nal emulator
  • recode_autodetect_utf8: irssi will rec­og­nize UTF-8 input auto­mat­i­cally and treat it consequentially
  • recode_fallback: when we get some non-UTF-8 text from a chat peer, the text should be con­verted to this char­ac­ter set
  • recode: this enables the whole recode thing
  • recode_out_default_charset: this is very impor­tant: this is the default charset that you send out, unless dif­fer­ently spec­i­fied by a server/channel rule (we will see that shortly)
  • recode_transliterate: this enables translit­er­a­tion of the clos­est match: i.e. if some­one sends you a char­ac­ter that’s not in your charset, it will be translit­er­ate to the clos­est pos­si­ble one, or with a ques­tion mark, if none found

Now, you prob­a­bly need dif­fer­ent recodes on dif­fer­ent chan­nels, because you may speak dif­fer­ent lan­guages on dif­fer­ent chan­nels. For exam­ple, I send out UTF-8 when typ­ing on Eng­lish speak­ing chan­nels, and ISO-88591 or ISO-885915 when typ­ing on Finnish or Ital­ian speak­ing chan­nels, so peo­ple on the other end will always get my char­ac­ters right.

You need to add rules with the /recode command:

/recode add ircnet/foo ISO-8859-15
/recode add ircnet/bar ISO-8859-1
/recode add freenode/gee ISO-8859-1

Those com­mand will make you “speak” ISO-885915 on #foo on IRC­Net, and ISO-88591 on #bar and #gee in freen­ode. Every­where else you will “speak” UTF-8.

And this is what we get: here I’m typ­ing (er… I’m copy-pasting from Wikipedia) some text:

irssi.png

If you con­nect via SSH to a remote machine, where you run irssi inside screen, all you have to do is to set both sys­tems to use UTF-8, as explained in the begin­ning of this arti­cle, and then set the ter­mi­nal of the machine from which you SSH, to use UTF-8, as explained earlier.

9 responses so far

Fixing NVIDIA driver after a xserver-xorg-core upgrade in Debian and Ubuntu

Jan 16 2007 Published by Salvatore Iovene under Software

Using Debian Test­ing or Unsta­ble, or a fre­quently upgraded ver­sion of Ubuntu, when doing an apt-get update && apt-get upgrade often will install a slightly newer ver­sion of xserver-xorg-code, and this will break the NVIDIA pro­pri­etary dri­vers, if you, like me, pre­fer to install them using the offi­cial NVIDIA installer. When this hap­pens, at your next reboot, or next time you start X, this will crash.

Fol­low this instruc­tions and you won’t need to rein­stall the NVIDIA dri­ver from scratch each time. First of all, stop your login man­ager (gdm assumed here):

/etc/init.d/gdm stop

Then move to:

cd /usr/lib/xorg/modules/extensions

Nor­mally it should look like this:

total 956K
1 root root  19K 2007-01-09 21:13 libdbe.so
1 root root  34K 2007-01-09 21:13 libdri.so
1 root root 145K 2007-01-09 21:13 libextmod.so
1 root root   18 2007-01-15 20:42 libglx.so->libglx.so.1.0.9742
1 root root 676K 2007-01-15 20:42 libglx.so.1.0.9742
1 root root  28K 2007-01-09 21:13 librecord.so
1 root root  38K 2007-01-09 21:13 libxtrap.so

Notice the sym­bolic link from libglx.so to libglx.so.1.0.9742. In your case, instead, the instal­la­tion of a newer xserver-xorg-core over­wrote the libglx.so with the nor­mal one pro­vided by the X Server. What you have to do is sim­ply restore the pre­vi­ous sit­u­a­tion. Remove the libglx.so file:

sudo rm libglx.so

And make the sym­bolic link again:

sudo ln -s libglx.so.1.0.9746 libglx.so

Of course the ver­sion num­ber, in my case 1.0.9746 may be dif­fer­ent in your case. Now you can sim­ply start the gdm login man­ager again:

sudo /etc/init.d/gdm start

Every­thing should be work­ing again.

Thanks to http://osrevolution.wordpress.com/ for this.

One response so far

Is your stacktrace really corrupted?

Oct 17 2006 Published by Salvatore Iovene under Software

You may encounter, dur­ing your debug­ging ses­sions, the ‘stack cor­rup­tion’ prob­lem. Usu­ally you will find it out after see­ing your pro­gram run into a seg­men­ta­tion fault. Oth­er­wise, it must mean that some very mali­cious and sub­tle code has been injected into your pro­gram, usu­ally through a buffer over­run. What is a buffer over­run? Let’s exam­ine the fol­low­ing short C code:

#include <stdio.h>

void bar(char* str) {
    char buf[4];
    strcpy( buf, str );
}

void foo() {
    printf("Hello from foo!");
}

int main(void) {
    bar("This string definitely is too long, sorry!");
    foo();
    return 0;
}

There’s clearly some­thing wrong with it: as you can see, we are copy­ing ‘str’ to ‘buf’ with­out first check­ing the size of ‘str’. First of all there is a secu­rity issue, because if ‘str’ didn’t just come from a fixed string like in this case, but got inputted from some­where (maybe on a web­site), then there could be a string long enough to over­write the code of ‘foo’, and run mali­cious code on its behalf. What we have here, any­how, is just a seg­men­ta­tion fault. Let’s debug the program.


(gdb) file stack
Reading symbols from /home/siovene/stack...done.
(gdb) run
Starting program: /home/siovene/stack

Program received signal SIGSEGV, Segmentation fault.
0x6f6c206f in ?? ()
(gdb) backtrace
#0  0x6f6c206f in ?? ()
#1  0x202c676e in ?? ()
#2  0x72726f73 in ?? ()
#3  0xbf002179 in ?? ()
#4  0xb7df9970 in __libc_start_main ()
      from /lib/tls/i686/cmov/libc.so.6
Previous frame inner to this frame (corrupt stack?)

Obvi­ously some­thing must have gone wrong. In order to bet­ter under­stand what is going on, let’s make a step back, and let’s exam­ine a work­ing exam­ple instead:


#include <stdio.h>

void bar(char* str) {
    char buf[4];
    strcpy( buf, str );
}

void foo() {
    printf("Hello from foo!");
}

int main(void) {
    bar("abc");
    foo();
    return 0;
}

This is the same code, but it’s been stripped off of the long string that caused the seg­men­ta­tion fault, and in its place we find a harm­less 3 char­ac­ter string: ‘abc’. Let’s name the pro­gram stack.c anc com­pile it with debug informaion:


$> gcc -g -o stack stack.c

Now let’s debug it:


(gdb) file stack
Reading symbols from /home/siovene/stack...done.
(gdb) break bar
Breakpoint 1 at 0x80483ca: file stack.c, line 5.
(gdb) run
Starting program: /home/siovene/stack

Breakpoint 1, bar (str=0x8048545 "abc") at stack.c:5
5         strcpy( buf, str );

We have entered the bar() func­tion, let’s exam­ine the backtrace:


(gdb) backtrace
#0  bar (str=0x8048545 "abc") at stack.c:5
#1  0x0804840e in main () at stack.c:13

What is the address of the bar() function?


(gdb) print bar
$1 = {void (char *)} 0x80483c4

Let’s now be para­noid and check this out pro­duc­ing a dump of our executable:


$> objdump -tD stack > stack.dis

Open the file with your favorite edi­tor and look for ‘80483c4′, the address of bar():


080483c4 <bar>:
 80483c4: 55                    push   %ebp
 80483c5: 89 e5                 mov    %esp,%ebp
 80483c7: 83 ec 28              sub    $0x28,%esp
 80483ca: 8b 45 08              mov    0x8(%ebp),%eax
 80483cd: 89 44 24 04           mov    %eax,0x4(%esp)
 80483d1: 8d 45 e8              lea    0xffffffe8(%ebp),%eax
 80483d4: 89 04 24              mov    %eax,(%esp)
 80483d7: e8 0c ff ff ff        call   80482e8
 80483dc: c9                    leave
 80483dd: c3                    ret

Per­fect, that’s our func­tion. But now let’s get curi­ous. Where’s the stack pointer in the CPU registers?


(gdb) info registers
eax            0x0      0
ecx            0xb7ed11b4       -1209200204
edx            0xbff04f60       -1074770080
ebx            0xb7ecfe9c       -1209205092
esp            0xbff04f10       0xbff04f10
ebp            0xbff04f38       0xbff04f38
esi            0xbff04fd4       -1074769964
edi            0xbff04fdc       -1074769956
eip            0x80483ca        0x80483ca
eflags         0x282    642
cs             0x73     115
ss             0x7b     123
ds             0x7b     123
es             0x7b     123
fs             0x0      0
gs             0x33     51

The ‘esp’ reg­is­ter, on the archi­tec­ture this arti­cle is writ­ten on, is the stack pointer. Its address is 0xbff04f10. Let’s exam­ine the mem­ory at that point:


(gdb) x/20xw 0xbff04f10
0xbff04f10:  0x00000000   0x08049638   0xbff04f28   0x080482b5
0xbff04f20:  0xb7ecfe90   0xbff04f34   0xbff04f48   0x0804843b
0xbff04f30:  0xbff04fdc   0xb7ecfe9c   0xbff04f48   0x0804840e
0xbff04f40:  0x08048545   0x08048480   0xbff04fa8   0xb7db3970
0xbff04f50:  0x00000001   0xbff04fd4   0xbff04fdc   0x00000000

With this com­mand we have told GDB to exam­ine 20 words in exadec­i­mal for­mat at the address 0xbff04f10. That’s because the value of the stack pointer is the address of the back-chain pointer to the pre­vi­ous stack frame. So address 0×00000000 is the address of the pre­vi­ous stack frame. But 0×00000000 is put in the stack frame in con­cur­rence of the pro­gram entry point, i.e. the main() func­tion. This agrees with the fact that we know bar() was called by main()!

Every­thing looks ok and in place, since the pro­gram works per­fectly we weren’t expect­ing any­thing dif­fer­ent. Let’s now do the same with the faulty pro­gram. At the moment of the seg­men­ta­tion fault, the back­trace looked like this:


(gdb) backtrace
#0  0x6f6c206f in ?? ()
#1  0x202c676e in ?? ()
#2  0x72726f73 in ?? ()
#3  0xbf002179 in ?? ()
#4  0xb7df9970 in __libc_start_main ()
      from /lib/tls/i686/cmov/libc.so.6
Previous frame inner to this frame (corrupt stack?)

To see exactly what goes on, it would be bet­ter to debug it more carefully:


(gdb) file stack
Reading symbols from /home/siovene/stack...done.
(gdb) break bar
Breakpoint 1 at 0x80483ca: file stack.c, line 5.
(gdb) run
Starting program: /home/siovene/stack

Breakpoint 1, bar (str=0x8048580
                    "This string definitely is too long, sorry!")
                  at stack.c:5
5         strcpy( buf, str );
(gdb) next
6       }
(gdb) next
0x6f6c206f in ?? ()
(gdb) next
Cannot find bounds of current function

Let’s then try to fol­low back the stack­trace, as we did previously:


(gdb) backtrace
#0  0x6f6c206f in ?? ()
#1  0x202c676e in ?? ()
#2  0x72726f73 in ?? ()
#3  0xbf002179 in ?? ()
#4  0xb7e9b970 in __libc_start_main ()
      from /lib/tls/i686/cmov/libc.so.6
Previous frame inner to this frame (corrupt stack?)

(gdb) info registers
eax            0xbfeed1e0       -1074867744
ecx            0xb7ea4c5f       -1209381793
edx            0x80485ab        134514091
ebx            0xb7fb7e9c       -1208254820
esp            0xbfeed200       0xbfeed200
ebp            0x6f742073       0x6f742073
esi            0xbfeed294       -1074867564
edi            0xbfeed29c       -1074867556
eip            0x6f6c206f       0x6f6c206f
eflags         0x246    582
cs             0x73     115
ss             0x7b     123
ds             0x7b     123
es             0x7b     123
fs             0x0      0
gs             0x33     51

(gdb) x/20xw 0xbfeed200
0xbfeed200:  0x202c676e   0x72726f73   0xbf002179   0xb7e9b970
0xbfeed210:  0x00000001   0xbfeed294   0xbfeed29c   0x00000000
0xbfeed220:  0xb7fb7e9c   0xb7fee540   0x08048480   0xbfeed268
0xbfeed230:  0xbfeed210   0xb7e9b932   0x00000000   0x00000000
0xbfeed240:  0x00000000   0xb7feeca0   0x00000001   0x08048300

(gdb) x/20xw 0x202c676e
0x202c676e:     Cannot access memory at address 0x202c676e

There’s only one expla­na­tion to that: the stack mem­ory has been over­writ­ten and now con­tains gib­ber­ish. We have been very unlucky with our exam­ple, but this gave us the tools to imag­ine another case. Let’s assume the stack got actu­ally cor­rupted not because it was over­writ­ten acci­den­tally, but because GDB was fail­ing to build it. In this case you are still able to nav­i­gate it back­wards. All you need to do it keep fol­low­ing the value of the stack frames, start­ing from the ‘esp’ reg­is­ter, until you reach 0×000000. Write all the addresses down, and then use ‘obj­dump’ to obtain the dis­as­sem­bly and sym­bols infor­ma­tion from the binary. All is left, now, is to check the names of the sym­bols match­ing the pinned up addresses.

If you can actu­ally do that, than you have suc­cess­fully recon­structed your stack­trace. It wasn’t really cor­rupted by a bug in your pro­gram, but sim­ply GDB missed to keep it up with it.

5 responses so far