πŸ€¦β€β™‚οΈ When you dump your data prior to deleting your account, you get a nicely looking ZIP with JSON, JPG, MP4 etc.

I've started importing that into ElasticSearch so that all the valuable 10 years old arguments aren't lost etc etc.

This is a fragment of text in Polish taken from the dump (JSON-encoded Unicode):

nios\u00c4\u0085 za sob\u00c4\u0085

Comes out, Facebook screwed up character encodings in their internal representation πŸ€¦β€β™‚οΈ

Follow

What they dump is *binary* representation of UTF-8 *encoded* Unicode character U+105 (LATIN SMALL LETTER A WITH OGONEK) but with \u prefix that confuses decoder into thinking it's *two* Unicode character with code points U+C4 (LATIN CAPITAL LETTER A WITH DIAERESIS) U+85 (NEXT LINE).

They totally confused the concepts of Unicode *codepoint* (the number of a character in the Unicode catalogue), and character *encoding* (one of possible binary representations of the character).

Β· Β· 0 Β· 0 Β· 1
Sign in to participate in the conversation
Mastodon πŸ” privacytools.io

Fast, secure and up-to-date instance. PrivacyTools provides knowledge and tools to protect your privacy against global mass surveillance.

Website: privacytools.io
Matrix Chat: chat.privacytools.io
Support us on OpenCollective, many contributions are tax deductible!