regex - Why doesn't Perl v5.22 find all the sentence boundaries? -
this fixed in perl 5.22.1. write in perl v5.22 adds fancy unicode word boundaries.
perl v5.22 added unicode assertions tr #29. i've been playing sentence boundary assertion, seems find start , end of text:
use v5.22; $_ = "see spot. (spot dog.) see spot run. run spot, run!\x{2029}new paragraph."; while( m/\b{sb}/g ) { "sentence boundary @ ", pos; }
the output notes sentence boundaries @ start , end of text, not after full stops, sentence terminators, or parens:
sentence boundary @ 0 sentence boundary @ 70
the unicode breaks tester shows them expect them based on tr #29.
i couldn't find non-trivial tests in perl source feature. i'm digesting technical report create appropriate test cases, far looks untested , broken feature.
calle dybedahl's comment gets right (and when turn answer i'll accept that). broken feature in v5.22.0, , far can tell, untested. had issue compiling stuff latest perls last night , ended day question.
the perl5.22.1 perldelta not mention particular changes (and "mention" might strong since merely alludes possible things wrong without enumerating them). mentions incompatible change 5.20.0 (a cut , paste error?), "single" exception, more 1 issue. reference "sane" made me think of changes related panic issue in next subsection. mention of "several bugs" 1 rt.perl.org reference made me think bugs related panic issue.
=head1 incompatible changes
there no changes intentionally incompatible 5.20.0 other following single exception, deemed sensible change make in order new c<\b{wb}> , (in particular) c<\b{sb}> features sane before people decided they're worthless because of bugs in perl 5.22.0 implementation , avoided them in future. if others exist, bugs, , request submit report. see l below.
=head2 bounds checking constructs
several bugs, including segmentation fault, have been fixed bounds checking constructs (introduced in perl 5.22) c<\b{gcb}>, c<\b{sb}>, c<\b{wb}>, c<\b{gcb}>, c<\b{sb}>, , c<\b{wb}>. c<\b{}> ones match empty string; none of c<\b{}> ones do. l<[perl #126319]|https://rt.perl.org/ticket/display.html?id=126319>
additionally, perlrebackslash, new boundaries documented, doesn't mention don't work in v5.22.0.
i disregarded possible fix because of incongruities in perldelta , prior experience i've had new features aren't adequately (or @ all) tested in perl source. prematurely cut off line of investigation , have saved myself couple of hours. it's fault not getting code running on latest binaries, had become fixated on idea doing wrong , code problem. despite numerous past experiences contrary, wasn't entertaining thoughts (other update ucd) perl wrong.
now i'm @ different machine , have working perl-5.22.1, see program works expected in point release. perldelta have been better here.
Comments
Post a Comment