Unicode storage of \u202b and \u202c in a Unicode-aware database?

I’m building a new product for toponyms and in it the Arabic shows kinda like this:

^IArabic^I<202b>ﺰﻤﺑﺎﺑﻮﻳ<202c>^I<202b>ﺞﻫﻭﺮﻳﺓ ﺰﻤﺑﺎﺑﻮﻳ<202c>$   

Actually not quite. This is a real cluster fuck for my ASCII-spewing terminal, so I’ll make an exception and screenshot text.

Arabic

My question is about those U202B “Right-To-Left Embedding”, and U202C “Pop Directional Formatting”. Do those get stored as data? My first assumption was that the characters were rendered and not in the file, but alas they are there..

360    5E03 97E6 5171 548C 56FD 000A 0009 0041 0072 0061 0062 0069 0063 0009 202B 0632    布韦共和国..Arabic..ز .............................................................................^HERE 389    0645 0628 0627 0628 0648 064A 202C 0009 202B 064F 062C 0647 0648 0631 064A 0629    مبابوي...ُجهورية .....................................^HERE.....^HERE 422    0020 0632 0645 0628 0627 0628 0648 064A 202C 000A 0009 004E 006F 0074 0065 0073    .زمبابوي...Notes ...............................................^HERE 

When storing Arabic in a database, do you typically store \u202b, and \u202c? They seem like they’re rendering characters and not technically data? I’m simply wanting to process this text to throw in a database, and wondering if these characters should be present in the database, or stripped before insert.

Background

  • The screenshot was taken with VIM in a terminal (Kitty) which does not support the Arabic text because all characters get displayed on a grid.
  • The text comes from text-extraction (using pdftotext)
  • The pdf was produced by the “United Nations Group of Experts on Geographical Names”. You can find the pdf (E/CONF.105/13) freely available here.