#facebook π€¦ββοΈ When you dump your data prior to deleting your account, you get a nicely looking ZIP with JSON, JPG, MP4 etc.
I've started importing that into ElasticSearch so that all the valuable 10 years old arguments aren't lost etc etc.
This is a fragment of text in Polish taken from the dump (JSON-encoded Unicode):
nios\u00c4\u0085 za sob\u00c4\u0085
Comes out, Facebook screwed up character encodings in their internal representation π€¦ββοΈ
If you're confused too, don't be like #facebook - watch my video from OWASP AppSec 2018 where I went into details how these two are different:
https://scitech.video/videos/watch/38bc6082-c97a-4422-bbb5-5a96d94f8603
What they dump is *binary* representation of UTF-8 *encoded* Unicode character U+105 (LATIN SMALL LETTER A WITH OGONEK) but with \u prefix that confuses decoder into thinking it's *two* Unicode character with code points U+C4 (LATIN CAPITAL LETTER A WITH DIAERESIS) U+85 (NEXT LINE).
They totally confused the concepts of Unicode *codepoint* (the number of a character in the Unicode catalogue), and character *encoding* (one of possible binary representations of the character).