Transcript Globalization Features in Whidbey’s CLR
Globalization Features in Whidbey’s CLR Michael Kaplan
Technical Lead Globalization Infrastructure, Fonts and Tools Microsoft Windows International Division http://blogs.msdn.com/michkap April 25, 2005
Customized Cultures and Regions
CultureAndRegionInfoBuilder class
Create an override to an existing culture Create based on an existing culture Create from scratch
Must be an administrator to register Can register the file on multiple machines
April 25, 2005
CultureAndRegionInfoBuilder sample
CultureAndRegionInfoBuilder carib = new CultureAndRegionInfoBuilder(“de-DE-MineMine”, CultureAndRegionModifiers.None); // load up all of the existing data for German and for Germany....
carib.LoadDataFromCultureInfo(new CultureInfo(“de-DE", false)); carib.LoadDataFromRegionInfo(new RegionInfo(“de”); // Change a property carib.ThreeLetterISORegionName = “ZZZ”; // Register the culture on the machine carib.Register(); // Use the new culture CultureInfo ci = new CultureInfo(“de-DE-MineMine”); April 25, 2005
CaRIB serialization with LDML
Locale Data Markup Language Described in UTS#35 at
http://unicode.org/reports/tr35/
CaRIB objects can be saved as LDML files Data can be loaded from LDML files
CaRIB will do its best with files it did not create April 25, 2005
LDML Sample
string file1 = Path.GetTempFileName(); File.Delete(file1); CultureInfo ci = new CultureInfo("ar-EG"); RegionInfo ri = new RegionInfo("de-DE"); CultureAndRegionInfoBuilder carib = new CultureAndRegionInfoBuilder("x-en-US-Pepsi", CultureAndRegionModifiers.None); carib.LoadDataFromCultureInfo(ci); carib.LoadDataFromRegionInfo(ri); carib.Save(file1); carib = CultureAndRegionInfoBuilder.CreateFromLdml(file1); carib.Register(); April 25, 2005
When Windows knows more than .NET
As of XPSP2, there are 25 new locales in Windows: Bengali - India Croatian - Bosnia and Herzegovina Bosnian - Bosnia and Herzegovina Serbian - Bosnia and Herzegovina (Latin) Serbian - Bosnia and Herzegovina (Cyrillic) Welsh - United Kingdom (more info in English, in Welsh) Maori - New Zealand Malayalam - India Maltese - Malta Quechua - Bolivia Quechua - Ecuador Quechua - Peru Setswana / Tswana - South Africa isiXhosa / Xhosa - South Africa isiZulu / Zulu - South Africa Sesotho sa Leboa / Northern Sotho - South Africa Northern Sami - Norway Northern Sami - Sweden Northern Sami - Finland Lule Sami - Norway Lule Sami - Sweden Southern Sami - Norway Southern Sami - Sweden Skolt Sami - Finland Inari Sami - Finland There will be more in future service packs In Longhorn, there will be 75 or more new locales April 25, 2005
Windows-only Cultures
The solution: Windows-only cultures!
Synthesizes a CultureInfo object when Windows supports a locale that the .NET Framework does not know how to create itself April 25, 2005
Windows only culture test
foreach(CultureInfo culture in CultureInfo.GetCultures(CultureTypes.WindowsOnlyCultures)) { Console.WriteLine(ci.Name); } // New cultures on XP SP2 include: // mt-MT, bs-BA-Latn, smn-FI, smj-NO, smj-SE, sms-FI, sma-NO, // sma-SE, quz-BO, quz-EC, quz-PE, ml-IN, bn-IN, cy-GB, and more April 25, 2005
Special CultureInfo support for SQL Server 2005 (Yukon)
SQL Server locale semantics: One setting for UI and formatting Another setting for collation/encoding .NET/Windows semantics One setting for UI Another setting for formatting/collation Solution Special GetCultureInfo override that takes two CultureInfo names for the two SQL Server settings April 25, 2005
How Yukon uses this support
Microsoft.ReportingServices.Diagnostics.Localization
CatalogCulture ClientPrimaryCulture DefaultReportServerCulture FallbackUICulture InstalledCultureNames ReportParameterCulture SqlCulture April 25, 2005
New locale properties/methods
TextInfo CultureName LCID CompareInfo Name DateTimeFormatInfo ShortestDayNames MonthGenitiveNames AbbreviatedMonthGenitiveNames NumberFormatInfo NativeDigits DigitSubstitution CultureInfo IsCustomCulture IetfLanguageTag CultureTypes GetCultureInfo() GetCultureInfoByIetfLanguageTag() RegionInfo GeoId NativeName CurrencyEnglishName (Can now create via full culture names) April 25, 2005
Updates to encodings
Now built into the BCL Improved performance more flexibility consistent results across supported platforms Encoding enumeration API UTF-32 support (little endian and big endian) UTF-16 big endian support Encoding/decoding fallbacks Exception Replacement “Best fit” Custom April 25, 2005
public class NumericEntitiesFallback : EncoderFallback { public override EncoderFallbackBuffer CreateFallbackBuffer() { return new NEFallbackBuffer(); } } public override int MaxCharCount { get { return 8; } } public class NEFallbackBuffer : EncoderFallbackBuffer { // Store our default string private String strEntity; int fallbackCount = -1; int fallbackIndex = 0; // Fallback Methods public override bool Fallback(char charUnknown, int index) { // If we had a buffer already we're being recursive, throw, // it's probably at the suspect character in our array.
if (fallbackCount >= 0) ThrowLastCharRecursive(unchecked((int)charUnknown)); // Go ahead and get our fallback strEntity = String.Format("{0};", (int)charUnknown); fallbackCount = strEntity.Length; fallbackIndex = 0; } return fallbackCount != 0; public override bool Fallback(char charUnknownHigh, char charUnknownLow, int index) { // Double check input surrogate pair if (!Char.IsHighSurrogate(charUnknownHigh)) throw new ArgumentOutOfRangeException("charUnknownHigh", “supposed to be between 0xD800 and 0xDBFF"); if (!Char.IsLowSurrogate(charUnknownLow)) throw new ArgumentOutOfRangeException("CharUnknownLow", “supposed to be between 0xD800 and 0xDBFF"); // If we had a buffer already we're being recursive, throw, it's // probably at the suspect character in our array.
if (fallbackCount >= 0) ThrowLastCharRecursive(Char.ConvertToUtf32(charUnknownHigh, charUnknownLow)); // Go ahead and get our fallback strEntity = String.Format("{0};", Char.ConvertToUtf32(charUnknownHigh, charUnknownLow)); fallbackCount = strEntity.Length; fallbackIndex = 0; } return fallbackCount != 0; public override char GetNextChar() { // We want it to get < 0 because == 0 means that the current/last // character is a fallback and we need to detect recursion. We // could have a flag but we already have this counter.
fallbackCount--; // Do we have anything left? 0 is now last fallback char, negative // is nothing left if (fallbackCount < 0) return (char)0; // Need to get it out of the buffer.
return strEntity[fallbackIndex++]; } public override bool MovePrevious() { fallbackCount++; fallbackIndex--; return true; } public override int Remaining { get { return (fallbackCount < 0) ? 0 : fallbackCount; } } // private helper methods private void ThrowLastCharRecursive(int charRecursive) { // Throw it, using our complete character throw new ArgumentException( String.Format("Last character \\u{0:4X} was a recursive fallback", charRecursive), "chars"); } } April 25, 2005
Collation Improvements
OrdinalIgnoreCase Same results as ToUpper/Ordinal Matches OS file system results Correct Serbian collation Fixed in Windows XPSP2 Customer reported (MSDN Feedback Center) Better handling of ignored/ignorable characters IndexOf/LastIndexOf/IsPrefix/IsSuffix StartsWith/EndsWith, too April 25, 2005
OrdinalIgnoreCase sample
string strTest1 = "IamAString"; string strTest2 = "STRING"; if(strTest1.EndsWith(strTest2, StringComparison.OrdinalIgnoreCase)) { Console.WriteLine(“Successful test!”); }; April 25, 2005
Unicode normalization
Described in UAX#15 at http://www.unicode.org/reports/tr15/ String.IsNormalized() String.IsNormalized(NormalizationForm normalizationForm) String.Normalize() String.Normalize(NormalizationForm normalizationForm) NormalizationForm enumeration FormC, FormD, FormKC, FormKD õĥµ¨ (U+00f5 U+0068 U+0302 U+00b5 U+00a8) LATIN SMALL LETTER O WITH TILDE; LATIN SMALL LETTER H; COMBINING CIRCUMFLEX ACCENT; MICRO SIGN; DIAERESIS FormC: õĥµ¨ FormD: õĥµ¨ FormKC: õĥμ ̈ FormKD: õĥμ ̈ (U+00f5 U+0125 U+00b5 U+00a8) (U+006f U+0303 U+0068 U+0302 U+00b5 U+00a8) (U+00f5 U+0125 U+03bc U+0020 U+0308) (U+006f U+0303 U+0068 U+0302 U+03bc U+0020 U+0308) In collation, õĥµ¨ ≅ õĥµ¨ ≅ õĥμ ̈ ≅ õĥμ ̈ April 25, 2005
namespace àáâãäå { using System; using System.Text; using System.Globalization; class àáâãäå { [STAThread] static void Main(string[] args) { àáâãäå(); àáâãäå(); àáâãäå(); àáâãäå(); àáâãäå(); àáâãäå(); àáâãäå(); } static void àáâãäå(string àáâãäå) { StringBuilder àáâãäå = new StringBuilder(); StringInfo àáâãäå = new StringInfo(àáâãäå); àáâãäå .Append(àáâãäå.Normalize(NormalizationForm.FormC)); àáâãäå .Append(": "); for(int àáâãäå=0; àáâãäå < àáâãäå.LengthInTextElements; àáâãäå++) { string àáâãäå = àáâãäå.SubstringByTextElements(àáâãäå, 1); if(àáâãäå.IsNormalized(NormalizationForm.FormC)) { àáâãäå .Append("C"); } else if(àáâãäå.IsNormalized(NormalizationForm.FormD)) { àáâãäå .Append("D"); } else { àáâãäå .Append("_"); } } Console.WriteLine(àáâãäå .ToString()); return; } static void àáâãäå() { àáâãäå .àáâãäå("àáâãäå"); } static void àáâãäå() { àáâãäå .àáâãäå("àáâãäå"); } static void àáâãäå() { àáâãäå .àáâãäå("àáâãäå"); } static void àáâãäå() { àáâãäå .àáâãäå("àáâãäå"); } static void àáâãäå() { àáâãäå .àáâãäå("àáâãäå"); } static void àáâãäå() { àáâãäå .àáâãäå("àáâãäå"); } static void àáâãäå() { àáâãäå .àáâãäå("àáâãäå"); } } } April 25, 2005
IDN Mapping APIs
IdnMapping class Based on three RFCs (standard based on Unicode 3.2) 3490 - Internationalizing Domain Names in Applications (IDNA) 3491 - Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN) 3492 - Punycode: A Bootstring encoding of Unicode for Internationalized Domain Names in Applications (IDNA)
\u5B89\u5BA4\u5948\u7F8E\u6075-with-SUPER-MONKEYS
becomes
xn---with-SUPER-MONKEYS-pc58ag80a8qai00g7n9n
Properties AllowUnassigned (allows new Unicode characters) UseStd3AsciiRules (more like DNS rules) Methods GetAscii - Gets ASCII (Punycode) version of the string GetUnicode - Gets Unicode version of the string, normalized and limited to IDNA characters.
April 25, 2005
Unicode property information
New CharUnicodeInfo class Extends methods on Char Offical data from the Unicode Character Database at http://www.unicode.org/ucd/ IsWhiteSpace GetNumericValue GetDigitValue GetDecimalDigitValue GetUnicodeCategory GetBidiCategory April 25, 2005
New text element support in the StringInfo class
StringInfo ctor that takes a string StringInfo.String
StringInfo.LengthInTextElements
StringInfo.SubstringByTextElements() Both use ParseCombiningCharacters() to get their results
April 25, 2005
New StringInfo props/methods sample
StringInfo si = New StringInfo("A\u0300\u0301\u0300e\u0300\u0301\u0300“); Console.WriteLine(si.LengthInTextElements); // Length is two for(int ich = 0; ich < si.LengthInTextElements; ich++) { Console.WriteLine(si.SubstringByTextElements(ich, 1); } April 25, 2005
New supplementary character support in lots of methods
New signature -- (String s, int index) IsControl, IsDigit, IsLetter, IsLetterOrDigit, IsLower, IsNumber, IsPunctuation, IsSeparator, IsSurrogate, IsSymbol, IsUpper, IsWhiteSpace, GetUnicodeCategory, GetNumericValue, IsHighSurrogate, IsLowSurrogate, IsSurrogatePair ConvertToUtf32, ConvertFromUtf32 methods April 25, 2005
References
MSDN Magazine Article Make the .NET World a Friendlier Place with the Many Faces of the CultureInfo Class March 2005 http://msdn.microsoft.com/msdnmag/issues/05/03/CultureInfo/ SQL Server Books Online “ International Considerations for SQL Server
4772-46a8-a8ef-bc134502b4e0.asp
”
http://whidbey.msdn.microsoft.com/library/en-us/icsql9/html/50dc4fa8-
My Blog http://blogs.msdn.com/michkap Some other blogs for int’l support in Whidbey http://blogs.msdn.com/AchimR http://www.dasblonde.net/ http://blogs.msdn.com/BCLTeam Other useful sites http://www.microsoft.com/globaldev/ http://lab.msdn.microsoft.com/productfeedback/ http://www.unicode.org/ April 25, 2005
Globalization Features in Whidbey’s CLR
Questions
April 25, 2005