Skip to content

Fix title() and capitalize()#7717

Merged
youknowone merged 4 commits intoRustPython:mainfrom
joshuamegnauth54:titlecase-fixes
May 9, 2026
Merged

Fix title() and capitalize()#7717
youknowone merged 4 commits intoRustPython:mainfrom
joshuamegnauth54:titlecase-fixes

Conversation

@joshuamegnauth54
Copy link
Copy Markdown
Contributor

@joshuamegnauth54 joshuamegnauth54 commented Apr 28, 2026

I fixed title() and capitalize() as well as enabled their respective tests. I fixed both by following CPython's logic closely, including implementing a workaround for sigma.

Besides that, I removed an unneeded dependency (unicode-casing) by using icu4x instead. This helps with consistency and removes outdated Unicode tables from the final RustPython binary.

Summary by CodeRabbit

  • Bug Fixes

    • Improved Unicode string case behavior (capitalize, title, istitle), including correct handling of Greek sigma, Turkish dotted I, German ß, combining marks, emoji-leading text, and preservation of invalid bytes.
  • Chores

    • Switched to ICU-based Unicode libraries and added a formatting/helper dependency.
  • Refactor

    • Reworked titlecasing and ASCII-bytes logic for clarity and reuse.
  • Tests

    • Enabled and expanded Unicode title/capitalize tests covering multiple edge cases.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Pro

Run ID: 80cb2aad-e0ff-4825-b958-db496724819d

📥 Commits

Reviewing files that changed from the base of the PR and between cbce58c and ce75f32.

⛔ Files ignored due to path filters (2)
  • Cargo.lock is excluded by !**/*.lock
  • Lib/test/test_str.py is excluded by !Lib/**
📒 Files selected for processing (6)
  • Cargo.toml
  • crates/vm/Cargo.toml
  • crates/vm/src/builtins/str.rs
  • crates/vm/src/bytes_inner.rs
  • crates/vm/src/utils.rs
  • extra_tests/snippets/builtin_str.py
✅ Files skipped from review due to trivial changes (1)
  • extra_tests/snippets/builtin_str.py
🚧 Files skipped from review as they are similar to previous changes (5)
  • crates/vm/src/utils.rs
  • Cargo.toml
  • crates/vm/Cargo.toml
  • crates/vm/src/builtins/str.rs
  • crates/vm/src/bytes_inner.rs

📝 Walkthrough

Walkthrough

Replaces unicode-casing with ICU crates and writeable; rewrites PyStr casing methods (capitalize, title, istitle) to use ICU titlecasing and Unicode property checks; extracts an ASCII title helper for bytes; adds a Vec-backed fmt::Write helper; enables/extends related string tests.

Changes

Unicode casing & string title/capitalize overhaul

Layer / File(s) Summary
Dependencies
Cargo.toml, crates/vm/Cargo.toml
workspace and crates/vm manifests updated: removed unicode-casing, added icu_casemap, icu_locale, icu_normalizer, icu_properties, and writeable; reordered related entries.
Imports / Traits
crates/vm/src/builtins/str.rs
Updated casing-related imports to use ICU casemap/locale, Unicode property groups, and Writeable; removed prior unicode_casing imports.
Infrastructure
crates/vm/src/utils.rs
Added pub(crate) struct VecFmtWriter(pub Vec<u8>) and impl fmt::Write to allow writing UTF-8 text into a Vec<u8>.
Bytes Helpers
crates/vm/src/bytes_inner.rs
Introduced pub(crate) fn title_ascii(bytes: &[u8]) -> Vec<u8>; PyBytesInner::title now delegates to this ASCII-only helper.
Capitalize (UTF-8)
crates/vm/src/builtins/str.rs
PyStr::capitalize UTF-8 path uses titlecase_first (ICU via Writeable) then lowercase_or_sigma for remaining characters.
Capitalize (WTF-8)
crates/vm/src/builtins/str.rs
PyStr::capitalize WTF-8 path processes valid UTF-8 chunks with ICU, preserves invalid bytes between chunks.
Title dispatch
crates/vm/src/builtins/str.rs
PyStr::title dispatches: ASCII -> title_ascii, UTF-8 -> titlecase_string, WTF-8 -> chunked titlecase_string preserving invalid bytes.
istitle update
crates/vm/src/builtins/str.rs
PyStr::istitle now checks GeneralCategoryGroup::TitlecaseLetter (and uppercase) for titlecase eligibility.
Helpers
crates/vm/src/builtins/str.rs
Added titlecase_first, titlecase_string, lowercase_or_sigma, and handle_capital_sigma implementing ICU-backed title behavior and sigma selection.
Tests
extra_tests/snippets/builtin_str.py
Enabled "DZ".title() == "Dz" and added str.capitalize() assertions for \"ßello\", \"İstanbul\", \"a\u0301bc\", \"ΣΙΓΜΑ\", \"😀hello\", and \"élan\".

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant PyStr
  participant ICU
  participant Props
  participant Bytes
  Caller->>PyStr: call .capitalize() / .title() / .istitle()
  PyStr->>ICU: titlecase_string / titlecase_first (language-aware mapping)
  PyStr->>Props: query GeneralCategory / TitlecaseLetter (sigma/context checks)
  ICU-->>PyStr: mapped titlecase segments (Writeable output)
  PyStr->>PyStr: lowercase_or_sigma / handle_capital_sigma adjustments
  PyStr->>Bytes: title_ascii for ASCII-only fast path or bytes chunks
  Bytes-->>PyStr: ASCII-chunk results
  PyStr-->>Caller: return final cased string (WTF-8: preserve invalid bytes)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested reviewers

  • ShaharNaveh
  • youknowone

Poem

A rabbit found a casing clue,
Swapped crates and stitched bytes true,
Sigma learned its contextual tune,
Vecs wrote letters by the moon,
Hooray — the strings now wake anew! 🐇✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly reflects the main changes: fixes to str.title() and str.capitalize() implementations with ICU-based titlecasing integration.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@joshuamegnauth54 joshuamegnauth54 changed the title Titlecase fixes Fix title() and capitalize() May 2, 2026
@joshuamegnauth54 joshuamegnauth54 marked this pull request as ready for review May 2, 2026 00:53
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 2, 2026

📦 Library Dependencies

The following Lib/ modules were modified. Here are their dependencies:

[x] test: cpython/Lib/test/test_str.py (TODO: 8)
[x] test: cpython/Lib/test/test_fstring.py (TODO: 19)
[x] test: cpython/Lib/test/test_string_literals.py (TODO: 4)

dependencies:

dependent tests: (no tests depend on str)

Legend:

  • [+] path exists in CPython
  • [x] up-to-date, [ ] outdated

Copy link
Copy Markdown
Contributor

@ShaharNaveh ShaharNaveh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM!

ty:)

Comment thread Lib/test/string_tests.py
@ShaharNaveh
Copy link
Copy Markdown
Contributor

@joshuamegnauth54 can you please fix the merge conflicts?

@youknowone youknowone enabled auto-merge (squash) May 4, 2026 11:47
auto-merge was automatically disabled May 4, 2026 17:10

Head branch was pushed to by a user without write access

@joshuamegnauth54
Copy link
Copy Markdown
Contributor Author

Fixed and also I fixed a small clippy lint that I introduced. I'm not sure why it didn't trigger earlier, but one of my helpers should have been pub(crate) instead of just pub.

@ShaharNaveh
Copy link
Copy Markdown
Contributor

Fixed and also I fixed a small clippy lint that I introduced. I'm not sure why it didn't trigger earlier, but one of my helpers should have been pub(crate) instead of just pub.

That's because we have added a new lint for that at #7762 :)

Copy link
Copy Markdown
Contributor

@ShaharNaveh ShaharNaveh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
tysm:)

@youknowone youknowone enabled auto-merge (squash) May 6, 2026 11:39
@youknowone
Copy link
Copy Markdown
Member

I am sorry, I merged another PR first and it made conflict. Could you resolve conflict? I will merge it immediately

auto-merge was automatically disabled May 6, 2026 18:19

Head branch was pushed to by a user without write access

@joshuamegnauth54
Copy link
Copy Markdown
Contributor Author

Fixed and force pushed. 😁

Copy link
Copy Markdown
Contributor

@ShaharNaveh ShaharNaveh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm!
ty:)

@youknowone youknowone enabled auto-merge (squash) May 9, 2026 15:32
`icu_casemap` is consistently maintained, official, and tracks the
latest Unicode versions. RustPython is also using other `icu4x` crates,
so using `icu_casemap` is more consistent.

As with islower and isupper, tracking the latest Unicode version is
important because character definitions shift over time which causes
discrepancies between RustPython and CPython.

This commit fixes title().
I dropped unicode-casing because it's cleaner to use icu4x for
everything. `icu4x` will also stay up to date whereas unicode-casing
will need to be periodically updated with new Unicode tables. Dropping
unicode-casing also removes some binary bloat due to the tables.

`capitalize()` mimics CPython behavior more closely now as well.
Notably, I implemented CPython's sigma edge case handler.
@youknowone youknowone merged commit 108461f into RustPython:main May 9, 2026
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants