By default Boost error messages when using Visual Studio are ANSI encoded and not UTF-8. To force Boost to emmit strings in UTF-8 the #define BOOST_SYSTEM_USE_UTF8
is needed. It could easily be added to Development/cmake/NmosCppCommon.cmake after line 147 like this:
add_definitions(/DBOOST_SYSTEM_USE_UTF8)
Unfortunately the console also needs a hint that printed text is UTF-8. I had to fiddle a little bit, but finally I found a portable solution:
#include <clocale>
static bool make_utf8()
{
std::string locale = setlocale(LC_CTYPE, "");
if (!locale.empty()) {
auto dot = locale.find_last_of('.');
if (dot != std::string::npos) {
locale = locale.substr(0, dot) + ".utf8";
char* result = setlocale(LC_CTYPE, locale.c_str());
return result != nullptr;
}
}
return false;
}
After initialising the log:
if (!make_utf8())
{
slog::log<slog::severities::error>(gate, SLOG_FLF) << "Setting console to UTF-8 mode failed";
}
Tested under Windows and Linux.
aholzinger
changed the title
Logging Boost error messages as UTF8
Logging Boost error messages as UTF-8
Jan 16, 2020
@aholzinger, thanks for the thorough investigation! Please would you give some examples of what happened before and after you made this change?
A couple of questions:
- Presumably we need to also build Boost libraries (those which aren’t header-only) and C++ REST SDK with
BOOST_SYSTEM_USE_UTF8
? - Could we achieve the same effect for logging via a call to e.g.
std::cerr.imbue(...)
(getting its current locale first if required) rather than setting the global locale withsetlocale
?
- Good question: I really don’t know, but I guess it would make sense. On my dev machine I did not compile boost with
BOOST_SYSTEM_USE_UTF8
and the ASIO error messages are correct («… ungültig …«), but I can’t imagine that compiled Boost code can emmit correct UTF-8 messages if Boost is not compiled withBOOST_SYSTEM_USE_UTF8
. - I’m not an expert. I made a short recherche and I doubt that it can be done via imbue, but what about this?
https://www.boost.org/doc/libs/master/index.html
Well it needs upcoming Boost version 1.73.
Hi Axel,
Sorry, did you mean to put in a more specific link in 2?
For 1, I think it may require std::ios_base::sync_with_stdio(false)
before any I/O or calling std::cerr.imbue(...)
but then I think it should work… (e.g. see answer to similar question on SO, though it’s possible Windows Console needs something different…)
It sounds like defining BOOST_SYSTEM_USE_UTF8
is a good idea, but should be done consistently through whole application code, not just nmos-cpp.
Have you tried adding /DBOOST_SYSTEM_USE_UTF8
to the CMake variable CMAKE_CXX_FLAGS
? CMake GUI makes this easy, or the default value can be overridden in the command line with something like /DCMAKE_CXX_FLAGS:STRING="/DBOOST_SYSTEM_USE_UTF8 /DWIN32 /D_WINDOWS /W3 /GR /EHsc"
(so as to keep the default symbol definitions and flags (though we forcibly increase the warning level for nmos-cpp)).
I have followed Boost.Nowide development, but I’d prefer not to require latest Boost in the library itself, maybe could be optionally used in nmos-cpp-node example app. Of course, the setlocale
solution, or imbue
solution if that works, also need to be done at application, not library, level. This could also be demonstrated in nmos-cpp-node example app.
Yes, BOOST_SYSTEM_USE_UTF8
needs to be set everywhere.
Yes, setting BOOST_SYSTEM_USE_UTF8
should work, but I think it should go to CMakeLists.txt (or to one of the .cmake files) finally.
I think for the nmos-cpp lib itself UTF-8 is not an issue, because only if the text is sent to a console the issue comes up and then it’s the responsibility of the console application to handle this. So yes, handling this in the example apps is fine.
Yes, setting
BOOST_SYSTEM_USE_UTF8
should work, but I think it should go to CMakeLists.txt (or to one of the .cmake files) finally.
If it is possible some users may not want this behaviour, we need to keep some way to turn it on and off. So maybe just documenting how to to set it with CMake command line is enough?
I think for the nmos-cpp lib itself UTF-8 is not an issue, because only if the text is sent to a console the issue comes up and then it’s the responsibility of the console application to handle this. So yes, handling this in the example apps is fine.
To decide this, we also have to consider whether exceptions and messages from other libraries will also use UTF-8. Generally this is probably safe assumption, but I am not certain yet.
If it is possible some users may not want this behaviour, we need to keep some way to turn it on and off. So maybe just documenting how to to set it with CMake command line is enough?
What would they loose?
On Linux the #define
does nothing.
On Windows those with US/latin codepage won’t notice adifference, would they?
On Windows those with a different codepage than US/latin won’t have garbled error messages anymore ==> improvement.
So why not generally define it?
To decide this, we also have to consider whether exceptions and messages from other libraries will also use UTF-8. Generally this is probably safe assumption, but I am not certain yet.
In general, portable projects that I know use UTF-8 always and the Windows port has to handle this. This puts a burden on the Windows port, but unfortunately Microsoft back then opted for UTF-16LE and since then didn’t do their homework to add a UTF-8 mode (adding some #define UNICODE_UTF8
to tchar.h). Like this (with something like #define UNICODE_UTF8
) all this hassle wouldn’t exist and Windows would take care to convert back and forth from/to UTF-8/UTF-16LE as it already does from/to ANSI/UTF-16LE when you choose the A-functions (if you don’t #define UNICODE
the macros expand to the A-functions like i.e. CreateFile
expands to CreateFileA
).
Long story short, I think defining BOOST_SYSTEM_USE_UTF8
will only improve the situation. One that doesn’t want UTF-8 (on Windows) has to do the same as with any other portable library, convert to/from ANSI or UTF16-LE (depending whether he/she uses #define UNICODE
or not.
Thanks, Axel, good points, and I agree that the direction of travel even for Windows libraries is char
with UTF-8. I am often working in the narrow world of ASCII, and these issues don’t arise, but of course, I have to handle Japanese deployment/integration regularly also, so I’ll use that as test environment to confirm.
Switching the console to UTF-8 in a portable way (tested on Linux and Windows):
#include <clocale>
int main(int argc, char* argv[])
{
// set locale for console to UTF-8, because boost is emitting
UTF-8 error messages when BOOST_SYSTEM_USE_UTF8 is defined
#ifdef BOOST_SYSTEM_USE_UTF8
std::setlocale(LC_CTYPE, ".UTF-8");
#endif
...
Could also being moved to a static function somewhere in the boost helpers to avoid the ugly #ifdef in main.
Which versions of Windows and Visual Studio did you test on? I have to consider several generations of compiler/standard libraries and OS unfortunately.
Sorry, overlooked that.
WIndows 10 x64
Visual Studio 2019
I could test on Visual Studio 2017 also.
Different Windows version will be hard. I currently have no 8.1, 8 or 7 running anymore
This will have to wait until I have time to fully test on multiple Linux, Windows and macOS platforms.
No problem, take your time.
My question is similar to yet different than: How to fix Package inputenc Error: «Unicode char u8:� not set up for use with LaTeX.»?
I’m using TexLive and Texmaker with the elsarticle.cls document class. Bibtex is being used for the references as required by the document class. Because several of the references are in Russian the following is in the preamble:
usepackage[T1,T2A]{fontenc}% Russian language support
usepackage[utf8]{inputenc}
%% `Elsevier LaTeX' style
bibliographystyle{elsarticle-num}
Both the .tex file and the .bib files are in UTF8 encoding (using the command file -I (filename) in terminal on a mac). For some reason when the document is compiled the .bbl file is in unknown encoding. Attempting to declare the unicode characters
DeclareUnicodeCharacter{2013}{textendash}
has no effect on the error message being thrown:
! Package inputenc Error: Unicode char u8:�. not set up for use with LaTeX.See the inputenc package documentation for explanation.Type H <return> for immediate help.... �.
The line numbers of the error match the lines of the .bbl file generated.
The other clues are that the errors do not happen until the document is compiled with pdflatex, bibtex, and then pdflatex. Further, if the .bib file with the Russian Cyrillic is commented out the error does not happen. The last clue from the terminal on my mac, file -I (filename) shows that the .bbl file is in utf8 when using another document that is in the report document class and uses:
bibliographystyle{unsrt}
Does anyone have any suggestions on troubleshooting this or a potential fix?
To make an example for yourself, simply download the elsarticle template from here:
and add this to the bib file:
@book{aizenshtata_features_1952,
address = {Ленинград},
series = {Главное управление гидрометеорологической службы при Совете Министров СССР. Труды Ташкентской геофизической обсерватории Вып. 6 (7)},
title = {Some features of the heat balance of the sandy desert (in {Russian}: Некоторые черты теплового баланса песчаной пустыни)},
abstract = {На обл. только загл. серии, Библиогр.: с. 54-55 (36 назв.)},
language = {rus},
publisher = {Гидрометеоиздат},
author = {Айзенштат, Борис Абрамович},
collaborator = {Зуев, М. В.},
year = {1952},
note = {pgs 3-55},
keywords = {Пустыни песчаные -- Тепловой баланс},
file = {3122870.pdf:/Users/nkampy/Library/Application Support/Zotero/Profiles/f12p038d.default/zotero/storage/PRG79WIE/3122870.pdf:application/pdf}
}
@book{kondratyev_transfer_1950,
address = {Москва; Ленинград},
title = {Transfer of long-wave radiation in the atmosphere (in Russian: Перенос длинноволнового излучения в атмосфере)},
abstract = {Библиогр.: с. 285-287 (71 назв.)},
language = {rus},
publisher = {Госизд-во техн-теоретлит},
author = {Кондратьев, Кирилл Яковлевич},
year = {1950},
keywords = {Атмосфера -- Лучистый теплообмен -- Определение, Тепловое излучение, длинноволновое -- Перенос и поглощение в атмосфере}
}
@inproceedings{gordov__calculation_1938,
address = {Ленинград},
title = {Calculation of direct solar radiation on inclined surfaces differently oriented on the latitude 42{degree} (in {R}ussian: Расчет прямой солнечной радиации на различно ориентированные наклонные поверхности для широты 42{degree})},
booktitle = {Материалы по агроклиматическому районированию субтропиков СССР},
publisher = {Гидрометеорологическое изд-во},
author = {{Гордов А.Н.}},
year = {1938},
file = {ER16-04761.pdf:/Users/nkampy/Library/Application Support/Zotero/Profiles/f12p038d.default/zotero/storage/TUTQK3QF/ER16-04761.pdf:application/pdf}
}
And add this to the .tex file:
%%%%%% Packages Added By Author %%%%%%
usepackage[T1,T2A]{fontenc}% Russian language support
usepackage[utf8]{inputenc}
%DeclareUnicodeCharacter{2013}{textendash}
%DeclareUnicodeCharacter{00A0}{ }
%DeclareUnicodeCharacter{00A0}{~}
usepackage[russian,USenglish]{babel}
usepackage{gensymb}% degree command
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
and this in the main body after the template citations:
Test references: cite{aizenshtata_features_1952, kondratyev_transfer_1950, gordov__calculation_1938}
You can see for yourself that the problem is with how the elsarticle article class handles conversion of Cyrillic letters in the names to initials for the bibliography. You can also see that unicode characters are being handled correctly otherwise.
If you’ve landed here it means you’ve been hit by this message in your program. In this post I’ll quickly introduce you to what «UTF-8 byte sequences» are, why they can be invalid and how to solve this problem in Ruby.
Short introduction to UTF-8 and other encodings
UTF-8 is, as explained in Wikipedia, is a set of codepoints (in simple words: numbers representing characters). Every character in UTF-8 is a sequence of 1 up to 4 bytes.
Apart from UTF-8 there are also other encodings like ISO-8859-1 or Windows-1252 — you may have seen these names before in your programming career. These encodings cover a big set of characters, including special latin characters etc.
Now, even though UTF-8 covers a huge set of characters as well it is not 100% compatible with the above mentioned encodings. Take a look at the following picture:
- Both UTF-8 and ISO-8859-1 are ASCII compatible — they include the same codepoints for digits and latin alphabet
- UTF-8 includes characters not present in ISO-8859-1, like the rocket emoji 🚀
- Both UTF-8 and ISO-8859-1 include «Å» characters, but these letters are defined using different codepoints — c385 in UTF-8 and c5 in ISO-8859-1
Encodings compatibility
Why does an UTF-8 invalid byte sequence error happen?
Ruby’s default encoding since 2.0 is UTF-8. This means that Ruby will treat any string you input as an UTF-8 encoded string unless you tell it explicitly that it’s encoded differently.
Let’s use the Å
character from the introductory diagram to present this problem.
Imagine you have a file file.txt
containing a following string: "vandflyver xC5rhus"
. As you already know C5
codepoint corresponds to Å
in ISO-8859-1 and isn’t present in UTF-8 encoding. Ruby however doesn’t know that the original encoding of the file is ISO-8859-1 and will by default interpret it as UTF-8.
So, the following operation will result in the infamous «UTF-8 Invalid byte sequence»:
irb(main):079:0> File.write("file.txt", "vandflyver xC5rhus")
=> 16
irb(main):080:0> open("file.txt", "r") { |io| io.read.split }
Traceback (most recent call last):
7: from /Users/bajena/.rbenv/versions/2.6.1/bin/irb:23:in `<main>'
6: from /Users/bajena/.rbenv/versions/2.6.1/bin/irb:23:in `load'
5: from /Users/bajena/.rbenv/versions/2.6.1/lib/ruby/gems/2.6.0/gems/irb-1.0.0/exe/irb:11:in `<top (required)>'
4: from (irb):80
3: from (irb):80:in `open'
2: from (irb):80:in `block in irb_binding'
1: from (irb):80:in `split'
ArgumentError (invalid byte sequence in UTF-8)
Enter fullscreen mode
Exit fullscreen mode
The «invalid UTF-8 byte sequence» here is our «Å» (C5) character as it’s not present in UTF-8. Fortunately there are a few ways to solve this problem.
Solution 1 — Provide a source encoding
If you know the encoding in which the file was originally written then all you have to do is to provide the encoding name when reading the input file. Ruby will automatically handle the character conversion for you:
irb(main):098:0> s = open("file.txt", "r:ISO-8859-1:UTF-8") { |io| io.read.split }
=> ["vandflyver", "Århus"]
irb(main):099:0> s[1][0].unpack("H*")
=> ["c385"]
Enter fullscreen mode
Exit fullscreen mode
In the last line I’ve used String.unpack method to print the converted character’s codepoint. As you can see it got correctly converted from C5
to C385
🎉
Solution 2 — String.encode
method
In many cases you won’t be that lucky to know the original encoding of the file. In this case String.encode method comes in handy. You can use it to skip invalid UTF-8 characters or replace them with a string of your choice.
Check out the following examples:
irb(main):102:0> open("file.txt", "r") { |io| io.read.encode("UTF-8", invalid: :replace) }
=> "vandflyver �rhus"
irb(main):103:0> open("file.txt", "r") { |io| io.read.encode("UTF-8", invalid: :replace, replace: "") }
=> "vandflyver rhus"
Enter fullscreen mode
Exit fullscreen mode
May not be beautiful, but it’s still better than crashing the app, right?
Solution 3 — Detect source encoding
In case you don’t know the source encoding and don’t want to skip the invalid characters you can use a character encoding detection gem called charlock_holmes.
It’ll analyze the string and provide you with the most probable source encoding and guess confidence (also a language code as a bonus :P).
Check it out in action:
irb(main):015:0> s = open("file.txt", "r") { |io| io.read }
=> "vandflyver xC5rhus"
irb(main):016:0> d = CharlockHolmes::EncodingDetector.detect(s)
=> {:type=>:text, :encoding=>"ISO-8859-1", :ruby_encoding=>"ISO-8859-1", :confidence=>70, :language=>"nl"}
irb(main):017:0> s.encode("UTF-8", d[:encoding], invalid: :replace, replace: "")
=> "vandflyver Århus"
Enter fullscreen mode
Exit fullscreen mode
Summary
First of all I hope that this post helped you to solve the Ruby issue you had. On the other hand I’m sure that also you’ve learned something useful. String encodings can sometimes be really f***ed up, so it’s really worth knowing what’s going on under the hood.