An interesting question asked by @danthesqlman in #sqlhelp (sqlcommunity.slack.com)
Having issues with Unicode in my XML, tried using a try_convert(varchar,fieldname) but not returning NULL.
Set it to have a test on my box, and weird results.
declare @n nvarchar(10) = N’ניקודות’
select try_convert(varchar(10),@n)
This doesn’t return NULL, but ?????????
I’m curious what would I be doing wrong, or how can i locate unicode within XML easily
And then when people suggested individual character shredding –
XML documents in a table over 200k rows, 2mb xml each, could take hours to parse 1 character at a time
There were a few suggestions, (my initial crap one was just dumping it to C#), but after a few jokes back and forth about how SQL Server was just returning normal question marks for TRY_CONVERT and how silly that was the idea came up… why not just:
- Replace all question marks with something unique? (I suggested a GUID)
- Run the conversion and then do a reverse replace, updating the data in place.
- Profit!
For a simple code example…
DECLARE @magic_value UNIQUEIDENTIFIER = NEWID()
SELECT
TRY_CONVERT
(
VARCHAR(100),
REPLACE(tar.name,'?',@magic_value)
)
FROM target_table AS tar
Any new question marks that exist in the output would be characters that failed the conversion process.
The test ran in ten minutes instead of a few hours… great!
Also a fun followup on weird SQL Server homoglyph conversion issues in general – https://bertwagner.com/posts/how-unicode-homoglyphs-can-thwart-your-database-security/