General Programming Discussion

7962 readers

6 users here now

A general programming discussion community.

Rules:

Be civil.
Please start discussions that spark conversation

Other communities

Systems

Functional Programming

Also related

[email protected]

founded 5 years ago

MODERATORS

[email protected]

How does Python 3.10's string work? (lemmy.ml)

submitted 2 years ago by [email protected] to c/[email protected]

2 comments fedilink hide all child comments

I don't know how Python 3.10's string works internally. Is it choosing between 8-bit, 16-bit, and 32-bit per character in runtime?

For example:

for line in open('read1.py'):
    print(line)

Can the line string be an 8-bit, 16-bit, or 32-bit character string in each iteration? Should the line be 8-bit by default and become a 32-bit string if that line has an emoji?

you are viewing a single comment's thread
view the rest of the comments

[–] [email protected] 1 points 2 years ago

If they used UTF-8 internally, they wouldn't need 4 versions of the split function.

        case PyUnicode_1BYTE_KIND:
            if (PyUnicode_IS_ASCII(self))
                return asciilib_split_whitespace(
                    self,  PyUnicode_1BYTE_DATA(self),
                    len1, maxcount
                    );
            else
                return ucs1lib_split_whitespace(
                    self,  PyUnicode_1BYTE_DATA(self),
                    len1, maxcount
                    );
        case PyUnicode_2BYTE_KIND:
            return ucs2lib_split_whitespace(
                self,  PyUnicode_2BYTE_DATA(self),
                len1, maxcount
                );
        case PyUnicode_4BYTE_KIND:
            return ucs4lib_split_whitespace(
                self,  PyUnicode_4BYTE_DATA(self),
                len1, maxcount
                );

https://github.com/python/cpython/blob/1402d2ceca8ccef8c3538906b3f547365891d391/Objects/unicodeobject.c#L9757