this post was submitted on 11 Aug 2022
5 points (100.0% liked)
General Programming Discussion
7962 readers
6 users here now
A general programming discussion community.
Rules:
- Be civil.
- Please start discussions that spark conversation
Other communities
Systems
Functional Programming
Also related
founded 5 years ago
MODERATORS
you are viewing a single comment's thread
view the rest of the comments
view the rest of the comments
Python strings are UTF-8 encoded by default. UTF-8 is a variable width format where each character can be of different width.
An decoder would first check the very first character bit and if that is
0
, then it is an 8-bit ASCII character. 16-bit characters would always start with110
and the second byte would start with10
. A 24-bit character would start with1110
and the following bytes would start with10
again. And for the largest 32-bit character, it would start with11110
and, again, the following three bytes start with10
.The Wikipedia page explains and visualizes it quite nicely.
If they used UTF-8 internally, they wouldn't need 4 versions of the split function.
https://github.com/python/cpython/blob/1402d2ceca8ccef8c3538906b3f547365891d391/Objects/unicodeobject.c#L9757