All notable changes to the Pomsky regular expression language will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
Join our Discord to get help or meet other users and contributors!
If you want to contribute, Pomsky now has a contributor's guide and a code of conduct.
- Added
--json
flag to work better with other tools and IDEs. Someone's already working on an IntelliJ plugin, which will be announced soon! - Every error and warning now has a diagnostic code such as
P0116
- Don't allow
::0
, which doesn't work - In Python, don't allow Unicode properties (
\p{Prop}
), since Python doesn't support these - In Python, don't allow forward references, since Python doesn't support these
- In Python, emit
\UHHHHHHHH
(whereH
is a hex digit) rather than\u{...}
for large code points - In Java, replace dashes (
-
) in Unicode properties with underscores rather than removing them - In Java and Ruby, emit
\x{...}
rather than\u{...}
for large code points - In Ruby, enforce that expressions with named capturing groups can't contain references to unnamed groups
- In Ruby, don't emit
\xHH
(whereHH
are two hex digits) for non-ASCII code points, since Ruby treats them as bytes rather than code points, and bytes above\x7F
may be invalid in UTF-8 - In Ruby, disallow repeated assertions (boundaries or lookarounds)
- In JS, wrap repeated assertions in an extra non-capturing group
-
The test harness and fuzzer were improved to also compile the output of all Python, Java, JavaScript and Ruby test cases to detect syntax errors. All flavors except C# are tested in this way now, and the fuzzer already found a few bugs.
-
Major refactor of error handling, so that reporting multiple errors at once becomes less awkward. This will enable us to report better diagnostics in the future.
-
GitHub CI improvements
-
Website improvements: Replaced PNG with SVG logo, new images and more spacing on the home page, redesigned “examples” section, new color palette, breadcrumbs on documentation pages, added Discord icon, bugfixes
0.8.0 - 2022-12-12
Special announcement: You can sponsor me now for my work on Pomsky. If you can spare a few dollars or convince your employer to donate, that would really help me to make maintaining Pomsky more sustainable. If I get enough donations, I can invest more time in the development of Pomsky, as there's still a lot of work to do!
Remember that you can also help out by filing issues or contributing 😉
-
Added inline regex expressions: Include text that is not transformed or validated. For example:
regex '[\w[^a-f]]'
This allows using regex features not yet supported by Pomsky, like nested character classes. Note, however, that Pomsky does not validate inline regexes, so there's no guarantee that the output is correct.
-
Added the dot (
.
). It matches anything except line breaks by default, or anything including line breaks in multiline mode. More information -
Added an optimization pass, which removes redundant groups, simplifies repetitions and deduplicates the contents of character classes.
Optimizations are useful when making heavy use of variables to write readable code and still get the most efficient output. More optimizations are planned, stay tuned!
-
Group names now must be no longer than 32 characters. For example,
:this_is_a_very_very_very_long_name()
is no longer allowed. The reason is that group names this long are unsupported by PCRE, and we're enforcing the same limit everywhere to make pomsky more consistent across regex flavors.
-
The CLI help interface was overhauled. It is now more informative and beautiful. To get help, type
pomsky -h
for short help, orpomsky --help
for longer descriptions and additional details. -
It is now possible to specify allowed features in the CLI. This was previously only possible in the Rust library. Use
pomsky --help
for more information.
- Fix Unicode script codegen for JavaScript: Pomsky now emits the correct syntax for Unicode scripts in JS.
- Escape
[
,&
and|
within character classes. This is required in regex flavors that support nested character classes. - Fix
\e
being emitted, even though it is not supported in the Rust flavor - Fix broken feature gates: A few feature gates were defunct and have been fixed.
- Fix position of error report labels with Unicode chars: This was a long-standing bug in miette that was fixed recently.
- Don't silently ignore exclamation points at the end of a character class.
- Only allow Unicode properties such as
Lowercase
orEmoji
in regex flavors that support them.
-
Audit dependencies using
cargo-audit
in continuous integration. This means that we'll be made aware of any vulnerability in our dependencies reported to the RustSec database. -
Make release binaries auditable: The binaries published on GitHub are now built with
cargo-auditable
. This means thatcargo audit bin /path/to/pomsky
can now scan all included dependencies. -
Remove thiserror dependency from the
pomsky
andpomsky-syntax
crates, improving compile time. -
Testing improvements:
-
Compile all PCRE and Rust regular expressions produced by integration tests to make sure the output is well-formed. This caught some of the bugs mentioned above! We're currently looking into ways to do the same with the other flavors.
-
Measure test coverage in CI and publish it to coveralls.io. The results are here (also accessible by clicking on the badge in the README). Note that the measurement is imperfect, so the results may not be accurate.
-
Add end-to-end tests for the CLI and improve test coverage
-
0.7.0 - 2022-09-10
-
atomic ()
groups, supported in all flavors except Python, Rust and JavaScript. Atomic groups discard backtracking information to optimize match performance (more information). -
The pomsky library is now published as a WASM module to npm! You can install it with
$ npm install pomsky-wasm # yarn add pomsky-wasm
How to use it is described here.
-
The parser was rewritten and is now much faster with fewer dependencies. In my benchmarks, it is 3 to 5 times faster than the previous parser.
-
The parser was moved to the
pomsky-syntax
crate. You can now directly use it in Rust programs, without pulling in the whole compiler. -
The limit for the number of repetitions after an expression has been removed, although the limitation was almost impossible to run into in real code.
-
Release binaries are now stripped by default, to reduce the binary size.
-
The clap argument parser was replaced with the much smaller lexopt. This further reduces the binary size.
- The
<%
,%>
,[cp]
and[codepoint]
syntax has been removed. Previously it was deprecated and issued a warning.
-
When compiling the library crate with
miette
support, thefancy
feature is now enabled by default to fix a compilation error. -
A repeated boundary or anchor is now correctly wrapped in parentheses.
0.6.0 - 2022-08-03
-
^
and$
as aliases forStart
andEnd
-
Leading pipes. This allows you to format expressions more beautifully:
| 'Lorem' | :group( | 'ipsum' | 'dolor' | 'sit' | 'amet' ) | 'consetetur'
-
Improved diagnostics for typos. When you spell a variable, capturing group or character class wrong, pomsky will suggest the correct spelling:
$ pomsky '[Alpabetic]' error: × Unknown character class `Alpabetic` ╭──── 1 │ [Alpabetic] · ────┬──── · ╰── error occurred here ╰──── help: Perhaps you meant `Alphabetic`
-
Many regex syntax diagnostics were added. Pomsky now recognizes most regex syntax and suggests the equivalent pomsky syntax. For example:
$ pomsky '(?<grp> "test")' error: × This syntax is not supported ╭──── 1 │ (?<grp> "test") · ───┬─── · ╰── error occurred here ╰──── help: Named capturing groups use the `:name(...)` syntax. Try `:grp(...)` instead
-
A plus directly after a repetition (e.g.
'a'{2}+
) is now forbidden. Fix it by adding parentheses:('a'{2})+
The reason is that this syntax is used by regular expressions for possessive quantifiers. Forbidding this syntax in pomsky allows for better diagnostics.
-
Deprecated
[.]
,[codepoint]
and[cp]
. They should have been deprecated before, but the warnings were missed in the previous release. -
Pomsky now sometimes reports multiple errors at once. The number of errors is limited to 8 in the CLI.
0.5.0 - 2022-07-04
This is the first release since Rulex was renamed to Pomsky.
If you are using the rulex
crate, replace it with pomsky
. The rulex-macro
crate should be replaced with pomsky-macro
. To install the new binary, see instructions. If you installed rulex with cargo, you can remove it with
rm $(type -P rulex)
- Deprecation warnings for
<%
and%>
. These were deprecated before, but Pomsky wasn't able to show warnings until now.
-
Improved codegen for Unicode chars between 128 and 255
-
Some diagnostics involving built-in variables were improved
-
The words
atomic
,if
,else
andrecursion
are now reserved
Grapheme
is now only allowed in the PCRE, Java and Ruby flavors. Previously, it was accepted by Pomsky for some flavors that don't support\X
.- Keywords and reserved words are no longer accepted as variable names
- The
Rulex
struct was renamed toExpr
, andRulexFeatures
was renamed toPomskyFeatures
Span::range()
now returns anOption<Range<usize>>
instead of aRange<usize>
Expr::parse
andExpr::parse_and_compile
now return a(String, Vec<Warning>)
tuple
0.4.3 - 2022-06-19
-
Add libFuzzer and AFL fuzzing boilerplate to find panics
-
Add artificial recursion limit during parsing to prevent stack exhaustion. This means that groups can be nested by at most 127 levels. I don't think you'll ever run into this limitation, but if you do, you can refactor your expression into variables.
- Fixed crash caused by slicing into a multi-byte UTF-8 code point after a backslash or in a string
- Fixed crash caused by stack exhaustion when parsing a very deeply nested expression
0.4.2 - 2022-06-16
-
Built-in variables were added:
Start
as an alias for<%
, which matches the start of the stringEnd
as an alias for%>
, which matches the end of the stringCodepoint
andC
as aliases for[codepoint]
, matching a single code pointG
as an alias forGrapheme
, matching an extended grapheme cluster
-
Grapheme
was turned from a keyword into a built-in variable. -
The repository now has issue templates and a pull request template.
<%
, %>
, [codepoint]
, [cp]
and [.]
will be deprecated in the future. It is recommended
to use Start
, End
and Codepoint
/C
instead.
There won't be a replacement for [.]
, but you can use ![n]
to match any code point except
the ASCII line break.
-
#29: Fix a miscompilation of a repeated empty group, e.g.
()?
. Thanks, sebastiantoh! -
Make the parser more permissive to parse arbitrary negated expressions. This results in better error messages.
-
Add missing help messages to diagnostics and fix a few that were broken:
- When parsing
^
: UseStart
to match the start of the string - When parsing
$
: UseEnd
to match the end of the string - When parsing e.g.
(?<grp>)
: Named capturing groups use the:name(...)
syntax. Try:grp(...)
instead - When parsing e.g.
\4
: Replace\\4
with::4
- When parsing e.g.
(?<=test)
: Lookbehind uses the<<
syntax. For example,<< 'bob'
matches if the position is preceded with bob. - When parsing e.g.
(?<!test)
: Negative lookbehind uses the!<<
syntax. For example,!<< 'bob'
matches if the position is not preceded with bob.
- When parsing
-
Improve test suite: Help messages are now tested as well, and failing tests can be "blessed" when the output has changed. Test coverage was also improved.
-
The entire public API is now documented.
0.4.1 - 2022-06-03
- Fixed a miscompilation in situations where a variable followed by a
?
expands to a repetition
0.4.0 - 2022-06-03
The repository was moved to its own organization! 🎉 It also has a new website with an online playground!
-
API to selectively disable some language features
-
Online playground to try out Pomsky. You can write pomsky expressions on the left and immediately see the output on the right.
-
Ranges now have a maximum number of digits. The default is 6, but can be configured.
This prevents DoS attacks when compiling untrusted input, since compiling ranges has exponential runtime with regard to the number of digits.
ParseOptions
was moved out ofCompileOptions
. This means that theparse_and_compile
method now expects three parameters instead of two.
0.3.0 - 2022-03-29
-
A book, with instructions, a language tour and a formal grammar!
-
Variables! For example,
let x = 'test';
declares a variablex
that can be used below. Read this chapter from the book to find out more. -
Number range expressions! For example,
range '0'-'255'
generates this regex:0|1[0-9]{0,2}|2(?:[0-4][0-9]?|5[0-5]?|[6-9])?|[3-9][0-9]?
-
Relative references:
::-1
refers to the previous capturing group,::+1
to the next one -
w
,d
,s
,h
,v
andX
now have aliases:word
,digit
,space
,horiz_space
,vert_space
andGrapheme
. -
enable lazy;
anddisable lazy;
to enable or disable lazy matching by default at the global scope or in a group.
-
Made
greedy
the default for repetitions. You can opt into lazy matching with thelazy
keyword or globally withenable lazy;
. -
POSIX classes (e.g.
alnum
) have been renamed to start withascii_
, since they only support Basic Latin -
Double quoted strings can now contain escaped quotes, e.g.
"\"test\""
. Backslashes now must be escaped. Single quoted strings were not changed. -
Improved Unicode support
- In addition to Unicode general categories and scripts, pomsky now supports blocks and other boolean properties
- Pomsky now validates properties and tells you when a property isn't supported by the target regex flavor
- Shorthands (
[h]
and[v]
) are substituted with character classes when required to support Unicode everywhere
-
Named references compile to numeric references (like relative references), which are better supported
-
A
?
after a repetition is now forbidden, because it easy confuse to with a lazy quantifier. The error can be silenced by wrapping the inner expression in parentheses, e.g.([w]{3})?
.
R
was removed, because it didn't work properly, and I'm still unsure about the best syntax and behavior.
- A
?
following a repetition no longer miscompiles:([w]{3})?
now correctly emits(?:\w{3})?
instead of\w{3}?
. - A
{0,42}
repetition no longer miscompiles (it previously emitted{,42}
).
0.2.0 - 2022-03-12
- Improved the Rust macro; pomsky expressions are written directly in the Rust source code, not in a
string literal:
let regex: &str = rulex!("hello" | "world" '!'+);
- There are a few limitations in the Rust macro due to the way Rust tokenizes code:
- Strings with more than 1 code point must be enclosed in double quotes, single quotes don't work
- Strings can't contain backslashes; this will be fixed in a future release
- Code points must be written without the
+
, e.g.U10FFFF
instead ofU+10FFFF
- Pomsky expressions can contain Rust comments; they can't contain comments starting with
#
0.1.0 - 2022-03-11
Initial release