Showing posts with label bidi. Show all posts
Showing posts with label bidi. Show all posts

Thursday, July 18, 2024

Bidirectional Text (Part 2): Delving into Bidi -- Registration Now Open!


When: Tuesday, August 13, 2024 starting at 10:00 (San Francisco), 13:00 (New York), 17:00 (UTC) and 19:00 (Berlin).

About the Webinar

A number of scripts, such as Hebrew, Arabic and Urdu, write their letters horizontally on a page or screen, running right to left. A complication for these scripts is that other characters, such as digits, flow left-to-right, and can occur on the same line, or even alongside other left-to-right text, such as Latin. Text that handles both right-to-left and left-to-right text is called “bidirectional” text (“bidi” in short).

In the second of a three-part series, Part 2 will delve deeper into more advanced topics and practical applications of bidirectional text processing. Three expert speakers will be on hand to provide detailed insights and answer your questions, ensuring you gain a comprehensive understanding of the subject.

Who Should Register?
  • Translators/Localizers
  • Localization Tooling makers
  • I18n infrastructure developers
  • Linguists and language researchers
  • Application developers
  • Content authors
Session Presenters

Adil Allawi has worked for over 40 years in multilingual engineering. One of his early projects was working on one of the first implementations of right to left text in a personal computer on the Apple II. As such he feels personally responsible for all the bidi problems that have happened since. Adil has been a regular contributor to Unicode, consulting on the definition of Bidi isolates and auto direction and the encoding of Arabic Mathematical Symbols. He currently works at Apple where he has contributed to right-to-left support in multiple products including iWork, App Store, Music and Apple TV.

Ayman Aldahleh began his career in the late 1980s at a small software company named 4C in Kuwait, where he focused on developing PC-DOS applications that supported the Arabic language. He soon moved to Microsoft, where he led the Arabization of several early Microsoft products, including DOS, Works, Windows, and Word. At Microsoft, Ayman’s role expanded to include support for bidirectional and complex script languages, text rendering, font management, and accessibility. He eventually managed the engineering team that scaled the internationalization platform for all Microsoft Office applications, enhancing multilingual and machine translation features. His final role at Microsoft involved overseeing cross-platform user experience for the Microsoft Fluent design. Ayman retired from Microsoft in late 2023 but remains an enthusiastic advocate for technology and internationalization. He has been a member of the Unicode Board of Directors since 2017. Ayman earned a Bachelor of Science in Computer Engineering from the University of Arizona.

Roozbeh Pournader is an internationalization engineer who has been contributing to the Unicode Standard since 1999. He started his internationalization career in Iran in 1994 when he was a high school student. After moving to the United States, he has worked at companies such as Google and WhatsApp. He has received a Unicode Bulldog award for his contributions to Unicode and CLDR’s support for complex scripts, and is Vice Chair of the Unicode Script Encoding Working Group.

Registration: Registration is Open Now! Please note this session will also be recorded and available via the Unicode YouTube channel.



Supporting Resources for Bidirectional Text (Part 2): Delving into Bidi

Bidirectional Text (Part 1): The Basics of Bidi Video Recording: https://youtu.be/TWfvRdS_7x0

Frequently Asked Questions: https://unicode.org/faq/bidi.html

Articles:
Additional Articles from W3C:
About the Unicode Consortium

The Unicode Consortium is the premier non-profit open source, open standards body for the internationalization of all software and services.

For more than 30 years, the Unicode Consortium has coordinated the efforts of a worldwide team of volunteer programmers and linguists to standardize, evolve, and maintain a global software foundation that allows virtually every computer system and service to help people connect using their native language.

For additional information about Unicode, visit home.unicode.org.

Unicode Resources

Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
馃晧️馃挆馃弾️馃惃馃敟馃殌鐖�₿♜馃崁

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock


As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Friday, May 31, 2024

New Event on June 25 - Webinar on Bidirectional Text (Part 1): The Basics of Bidi

Registration is Now Open!

A number of scripts, such as Hebrew, Arabic and Urdu, write their letters horizontally on a page or screen, running right to left. A complication for these scripts is that other characters, such as digits, flow left-to-right, and can occur on the same line, or even alongside other left-to-right text, such as Latin. Text that handles both right-to-left and left-to-right text is called “bidirectional” text (“bidi” in short).

How to handle bidi text on browsers and in other software is challenging for both general users and implementers. This webinar will describe the basics with examples. It will be followed by a live question-and-answer period. A more in-depth question and answer session will take place August 13, 2024.

Who? If you are a translator/localizers, localization tooling maker, I18n infrastructure developer, linguist and language researcher, application developer, or a content author, you will want to join us for this webinar. Bring your questions to the people involved for the live Q&A.

When? Tuesday, 25 June 2024 starting at 8am (San Francisco), 11am (New York), and 5pm (Berlin).

Registration is Open Now! Please note this session will also be recorded and available via the Unicode YouTube channel.


Getting Started with Bidirectional Text (Part 1): The Basics of Bidi

Frequently Asked Questions: https://unicode.org/faq/bidi.html

Articles:
Additional Articles from W3C:

About the Unicode Consortium

The Unicode Consortium is the premier non-profit open source, open standards body for the internationalization of all software and services.

For more than 30 years, the Unicode Consortium has coordinated the efforts of a worldwide team of volunteer programmers and linguists to standardize, evolve, and maintain a global software foundation that allows virtually every computer system and service to help people connect using their native language.

For additional information about Unicode, visit home.unicode.org.

Unicode Resources


Adopt a Character and Support Unicode’s Mission

Looking to give that special someone a special something?
Or maybe something to treat yourself?
馃晧️馃挆馃弾️馃惃馃敟馃殌鐖�₿♜馃崁

Adopt a character or emoji to give it the attention it deserves, while also supporting Unicode’s mission to ensure everyone can communicate in their own languages across all devices.

Each adoption includes a digital badge and certificate that you can proudly display!

Have fun and support a good cause

You can also donate funds or gift stock


As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

Wednesday, September 13, 2023

Source Code Handling: Preventing Spoofing at the Source

header image
By: Mark Davis, Cofounder and CTO

The Unicode Consortium is providing a new resource to help programming tooling developers, programming language developers, and programming language users to deal with Unicode spoofing.

Background

Encompassing letters and symbols (over 149,000 in Unicode 15.1) across the world’s writing systems, it was inevitable that many of them would look similar — and sometimes identical. And of course, there are those who would take advantage of that to swindle. An example of this is “p邪ypal.com”, where the first ‘邪’ is actually a Cyrillic character that is confusable with the Latin alphabet ‘a’. 馃樀‍馃挮

In 2004, the Unicode Consortium began working to address this issue, focusing on URLs and other identifiers that could be spoofed, and produced a specification and technical report with best practices for detecting such cases. Implementations using those specifications have been widely deployed in operating systems.

In November of 2021, another class of problems was documented. It was demonstrated that malicious agents could write source code that would look to human reviewers as if it was secure, but actually contain hidden traps. There are three main categories of these spoofs: line-break spoofs, confusable spoofs, and bidirectional ordering spoofs.

Examples

  • Line-break spoofs can cause what appears to be a line of code to be actually commented out, as far as the compiler is concerned. This can happen with C11, for example:
    precondition image
    To a reviewer, this is an active line of code. But when U+2028 Line Separator is at the end of the first line, the C11 compiler will interpret this as one line consisting only of a comment!

  • The “p邪ypal.com” above is an example of a confusable spoof.

  • As for a bidirectional spoof, take pair of variables named A讗1 and A1讗; these look identical, but the former consists of the letters A and 讗 followed by the digit 1, whereas the latter consists of the letter A, the digit 1, and the letter 讗, in that order.
Such code might not even be malicious — it is too easy to accidentally give reviewers (or even the writer!) the wrong impression, leading to hidden software bugs — and just be very hard to understand; here’s an example:

The text “Error: {0} {1}", message” becomes RTL in translation.

The earlier work on spoofing identifiers was relevant to this work, but did not explicitly deal with the environment surrounding software development. Moreover, the guidance was aimed at internationalization experts, not programming language and software tooling developers.

Process

In response to this problem, the Consortium started a project in early 2022 to put together a cross-functional group of experts in Unicode processing, programming languages, and software development tooling to address these problems. That project resulted in the Source Code Working Group (SCWG), which brought together a set of experts to work through the possible problems.

The first results of this group were a number of enhancements to core Unicode specifications in September of 2022. UAX #9 provided an extended example of use of the important higher-level protocol HL4, and emphasized the use to mitigate misleading bidirectional ordering of source code, including potential spoofing attacks; UAX #31 provided important guidance on profiles for default identifiers and clarified that requirement on Pattern_White_Space and Pattern_Syntax characters applies to programming languages, and is relevant to issues of bidirectional ordering and potential spoofing attacks.

Impact

The final output of the group is Unicode Technical Standard #55, Source Code Handling. This new specification brings together in one place a description of the problems specific to source code, together with guidance and best practices for programming language and software tooling developers. Many of the APIs necessary for supporting those best practices were already specified and implemented in ICU, Unicode’s software library that is already in all modern operating systems. However, one new useful API has been added to ICU, and will be released in October 2023. This is the new bidiSkeleton function, used to detect identifiers such as A讗1 above.

Coordinated security-related updates have been made to UAX #9, Unicode Bidirectional Algorithm and UAX #31, Unicode Identifiers and Syntax along with updates to UTS #39, Unicode Security Mechanisms.

This work would not have been possible without the set of dedicated and knowledgeable people that made up the SCWG, especially Robin Leroy, the vice chair. Others include Alexei Chimendez, Asmus Freytag, Barry Dorrans, Catherine “whitequark”, Chris Ries, Corentin Jabot, Dante Gagne, Deborah Anderson, Ed Schonberg, Elnar Dakeshov, Jan Lahoda, Julie Allen, Ken Whistler, Liang Hai (姊佹捣), Manish Goregaokar, Mark Davis, Markus Scherer, Michael Fanning, Nathan Lawrence, Ned Holbrook, Peter Constable, Randy Brukardt, Rich Gillam, Richard Smith, Roozbeh Pournader, Steve Dower, and Tom Honermann. For more details on their contributions, see Acknowledgements.

Having completed its main task, the SCWG is formally being retired — but we are keeping the list of participants in case we need to call on their expertise in the future!



Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Tuesday, January 17, 2023

What’s New in Emoji 15.1?

Doing more, with less

By: Jennifer Daniel, Chair of the Emoji Subcommittee

[image phoenix]

This past Fall, the Unicode Technical Committee announced the delay of Unicode 16.0. This wasn’t without precedent — COVID slowed down the release of Unicode 14.0 in 2020 and the world seemed to survive 馃槈. Subcommittees were well prepared and adjusted accordingly, discussing what this meant for their respective areas of expertise.

For the Emoji Subcommittee (ESC) — the group responsible for defining the rules, algorithms, and properties necessary to achieve interoperability between different platforms for those smiley faces that appear on your keyboard (Shout out 馃榿馃グ馃ス馃馃馃馃樀‍馃挮!) — this delay presented an opportunity. Sure, we were so close to exhaling a sigh of relief (the intake period for Emoji 16.0 proposals had just completed). But upon learning we couldn’t ship any new codepoints until 2024 we turned our energy towards recommending new emoji based on existing ones. (These are called emoji ZWJ sequences. That's when a combination of multiple emoji display as a single emoji … like 馃懇 馃徑 +馃彮 = 馃馃徑‍馃彮).

When Less is More

An incredibly powerful aspect of written language is that it consists of a finite number of characters that can "do it all". And yet, as the emoji ecosystem has matured over time our keyboards have ballooned and emoji categories are about to hit or have hit a level of saturation. Upon reflecting on how emoji are used, the ESC has entered a new era where the primary way for emoji to move forward is not merely to add more of them to the Unicode Standard. Instead, the ESC approves fewer and fewer emoji proposals every year.

But our work is not done. Not by a longshot. Language is fluid and doesn’t stand still. There is more to do! This “off-cycle” gives us a chance to address some long-standing major pain points using emoji. The first one that came to mind: skin-tone.

What is a family?

The encoding of multi-person multi-tone support has matured over the years; however, the implementation can seem random to the average person: While it’s true, all people emoji have toned options (with the exception of characters where you can’t see skin like 馃ず) there are … misfits. Some two people emoji offer tone support ( 馃馃徎‍❤️‍馃馃徔) others do not ( 馃懐). A few non RGI emoji render with tone but with no affordance to change one of the two characters (For example, 馃ぜ馃従‍♂).

And then … There is the suite of family emoji (馃懆‍馃懄馃懆‍馃懄‍馃懄馃懆‍馃懅馃懆‍馃懅‍馃懄馃懆‍馃懅‍馃懅馃懇‍馃懄馃懇‍馃懄‍馃懄馃懇‍馃懅馃懇‍馃懅‍馃懄馃懇‍馃懅‍馃懅 馃懆‍馃懆‍馃懄馃懆‍馃懆‍馃懄‍馃懄馃懆‍馃懆‍馃懅馃懆‍馃懆‍馃懅‍馃懄馃懆‍馃懆‍馃懅‍馃懅馃懇‍馃懇‍馃懄馃懇‍馃懇‍馃懄‍馃懄馃懇‍馃懇‍馃懅馃懇‍馃懇‍馃懅‍馃懄馃懇‍馃懇‍馃懅‍馃懅馃懆‍馃懇‍馃懄馃懆‍馃懇‍馃懄‍馃懄馃懆‍馃懇‍馃懅馃懆‍馃懇‍馃懅‍馃懄馃懆‍馃懇‍馃懅‍馃懅馃應). These characters include two people, three people, sometimes four and none of them have any tone support (!). We seem to have a lot of family emoji and yet simultaneously not enough.

The 26 “family” emoji can be broken down into four groups:

[image families]

Despite the Unicode Standard containing 26 “family” emoji, each one of these glyphs is overly prescriptive with regard to delivering on a visual representation of a family. The inclusion of many permutations of families was well intentioned. But we can’t list them all, and by listing some of the combinations, it calls attention to the ones that are excluded.

What even is a family? For some, family is the people you were raised with. Others have embraced friends as their chosen family. Some families have children, other families have pets. There are multi-generational families, mutli-racial families and of course many families are any combination of all of these characteristics and more.

Fortunately, we don’t need to add 7000 variants to your keyboards (even this would fall short of capturing the breadth of "family" as a concept). Instead we can juxtapose individual emoji together to capture a concept with some reasonable level of specificity — not too unlike arranging letters together to create words to convey concepts 馃槈

[image toned families]

For emoji keyboards to advance in creating more intuitive and personalized experiences the Emoji Subcommittee is recommending a visual deprecation of the family emoji. This small set of emoji will be redesigned as part of a multi-phase effort to “complete the set” of toned variants for the remaining multi-person emoji. This of course begs the question: when there are as many families as there are people in the world, is there an effective way at conveying the concept of “family” without being overly prescriptive in defining what is and is not a family? Well, thankfully icons can do a lot of heavy lifting without requiring very much detail.
[image before-after]

When is an emoji running for the police or getting chased by them?

Another area the ESC is actively exploring is how the semantics of emoji sequences can differ when writing directionality changes. Some emoji characters have semantics that encode implicit directionality but when the string is mirrored and their meaning may be unintentionally lost or changed.

[image rightwards]
Left to Right Emoji Sequence
Quickly running towards an “exciting” police chase


[image leftwards]
Right to Left Emoji Sequence
Running away from the coppers


What, if anything, can we do to aid in ensuring that messages are meaningfully translated be them tiny pictures or tiny letters? As part of 15.1 we’re proposing a small set of emoji with strong directionality — with an initial focus on people — to face the opposite direction. Soon you too can run towards or away from … excitement.

Emoji 15.1

Given that the intake cycle of emoji proposals for Unicode 16.0 ended last July, the Emoji Subcommittee has also decided to temporarily delay the intake of Unicode Version 17.0 proposals until April 2024. Fortunately, you won’t have to wait until then to get new emoji. Among the list of recommendations includes 578 characters (most of them the candidates described above to support directionality). The list also includes a few humble additions including a broken chain, a lime, a non-poisonous mushroom, a nodding and shaking face, and a phoenix bird. Each one of these leverages a unique valid ZWJ sequence of emoji so while they look like atomic characters made of a single codepoint they are composed of two or more codepoints.

[image candidates]

Broken chain is the result of 馃敆馃挜, with a variety of meanings, such as freedom, breaking a cycle, or perhaps a broken url ;-). Like the bi-directional emoji touched on above, nodding face and shaking face are the result of 馃檪↔️and 馃檪↕️ respectively. Oh, and of course there is a phoenix rising from the ashes (馃惁馃敟), a perfect metaphor to capture where we are today.

The Unicode Technical Committee (UTC) will review the required documents at its first meeting of 2023 in January – and if these candidates move forward, you can expect an update from the UTC later this Spring and Summer.


Support Unicode
To support Unicode’s mission to ensure everyone can communicate in their languages across all devices, please consider adopting a character, making a gift of stock, or making a donation. As Unicode, Inc. is a US-based open source, open standards, non-profit, 501(c)3 organization, your contribution may be eligible for a tax deduction. Please consult with a tax advisor for details.

[badge]

Thursday, January 14, 2016

Proposed Update UAX #9, Unicode Bidirectional Algorithm

A new proposed update of UAX #9, Unicode Bidirectional Algorithm for the Unicode 9.0 release is now available for public review and comment.

The table in Section 2.7, Markup and Formatting, has been updated to reflect changes to isolates in HTML5 and CSS.

For further information and instructions on how to leave feedback, please see Public Review Issue #315.

Wednesday, November 25, 2015

New Character Property for Prepended Concatenation Marks

Arabic ImageThe Unicode Technical Committee is seeking feedback on a proposal to define a new character property for the class of prepended concatenation marks, also referred to as prefixed format control characters or, more generically, as subtending marks. Characters in that class include U+0600 ARABIC NUMBER SIGN and U+06DD ARABIC END OF AYAH. The new property, named Prepended_Concatenation_Mark and targeted for Unicode 9.0, would provide a mechanism to handle subtending marks collectively via properties rather than by hardcoded enumeration. A detailed description of the issue and how to provide feedback are given in Public Review Issue #310.

Tuesday, August 26, 2014

PRI #279: Proposed Update UAX #9, Unicode Bidirectional Algorithm

The proposed update for Unicode 8.0 addresses three issues where the bidirectional algorithm, as written, did not produce the intended results in certain specific cases. Details and justification for the proposed changes to the specification are in a background document accessible from the PRI page.

The closing date for this issue is October 20, 2014. For information about how to discuss this issue and how to supply formal feedback, please see the feedback and discussion instructions on the PRI page.

The Public Review Issues index page is: http://www.unicode.org/review/

Friday, June 28, 2013

Testing the Unicode Bidirectional Algorithm for Unicode 6.3

Unicode Standard Annex #9, Unicode Bidirectional Algorithm (UBA), has a major update slated for release in September, 2013. This update is the most significant change in Unicode 6.3. The changes to the algorithm and text have been already been approved by the Unicode Technical Committee, subject to final editorial review.

The Unicode Technical Committee is encouraging implementations to test their code against the new test files and the two reference implementations during the month of July, 2013. It is vital that the interpretation of the text of the specification in UAX #9 be absolutely clear, and that the values in the test data be thoroughly tested by at least two implementations before release, because any changes after release—even to fix problems—can cause significant interoperability problems. The UBA is used for displaying all Arabic and Hebrew text on the web and in application programs, so there are significant ramifications for any changes to the algorithm.

The proposed update to UAX #9 involves a substantial extension of the UBA to allow for the implementation of isolate runs, introducing new Bidi_Class property values and formatting characters in support of that extension. There are also changes to Section 3.3.5, Resolving Neutral and Isolate Formatting Types to resolve paired punctuation marks as a unit. For details, see http://www.unicode.org/reports/tr9/tr9-28.html.

For further information about the review see http://www.unicode.org/review/pri254/.