What is talxy?
Store captures and non-captures in source-string order
Hello Guest
  
  • Login
• Register…
• Start blog
  • Who, Where, When
• What can I do?
• What to Read?
  • Polls
• Avatars
• Interests
  • Cities and Countries
• Random blog
• Users search
  • Search
• Games
• Tests
• QAIX
  • Сообщества
• Talxy Chat
• Horoscope
• Online
 
Зарегистрируйся!

QAIX > Perl web-programming > Store captures and non-captures in source-string order 13 October 2008 21:46:50

  Recent blog posts: 
  They have birthday today: 
  Forums:   
  Discuss: 
  Recent forum topics: 
  Recent forum comments:
  Moderators:

Store captures and non-captures in source-string order

Moritz Lenz 13 October 2008 21:46:50
 When we write regexes, we generally capture stuff in a way that makes
the following semantic analysis easier. For example we could have a
regex m/ <this>+ <that>? <this>*/ if we're only interested in the match
trees of what <this> and <that> matches, not their respective order.

But if you want to re-used the match tree for something different (say,
instead of doing a semantic analysis we want to do syntax hilighting)
it's rather hard to reconstruct the original text, and what part of it
was matched by which subrule. Currently you have to fiddle with $/.from
and $/.to, and sort all subrules by their respective $/.from and $/.to,
and then work out which part hasn't been matched by subrules.

This is rather weird and error-prone difference, and I wonder if we
should provide some easier way to access all chunks of text in the order
that they were matched.

I guess this description isn't very clear, so I'll try with an example:

"abc 234 def 789 for 456" ~~ mm/ [ <ident> \d+ ]**0..2 'for' (\d+) /;
$/.chunks would be this list:

$<ident>[0],
' ',
'234',
' ',
$<ident>[1],
' ',
'789',
' ',
'for',
' ',
'456'

I don't know if the syntax and exact semantics are very good, but IMHO
we should have some way of reconstructing a match that is closer to the
original string than to the structure of the matching regex.

(I also don't know if that's feasible in terms of efficiency)

Any ideas?

Moritz

--
Moritz Lenz
http://perlgeek.de/­ | http://perl-6.de/ | http://sudokugarden­.de/
Add comment
Patrick R. Michaud 12 October 2008 19:08:50 permanent link ]
 On Sun, Oct 12, 2008 at 11:44:05AM +0200, Moritz Lenz wrote:
When we write regexes, we generally capture stuff in a way that makes
the following semantic analysis easier. For example we could have a
regex m/ <this>+ <that>? <this>*/ if we're only interested in the match
trees of what <this> and <that> matches, not their respective order.
[...]
But if you want to re-used the match tree for something different (say,
instead of doing a semantic analysis we want to do syntax hilighting)
it's rather hard to reconstruct the original text, and what part of it
was matched by which subrule.

Perhaps aliases...?

m/ <this>+ <that>? <andthen=this>* /

This is probably not exactly what you're looking for, but
that would be what I would look at for this specific example.

Pm
Add comment
Moritz Lenz 12 October 2008 19:34:49 permanent link ]
 Patrick R. Michaud wrote:
On Sun, Oct 12, 2008 at 11:44:05AM +0200, Moritz Lenz wrote:
When we write regexes, we generally capture stuff in a way that makes
the following semantic analysis easier. For example we could have a
regex m/ <this>+ <that>? <this>*/ if we're only interested in the match
trees of what <this> and <that> matches, not their respective order.
[...]
But if you want to re-used the match tree for something different (say,
instead of doing a semantic analysis we want to do syntax hilighting)
it's rather hard to reconstruct the original text, and what part of it
was matched by which subrule.
Perhaps aliases...?
m/ <this>+ <that>? <andthen=this>* /
This is probably not exactly what you're looking for, but
that would be what I would look at for this specific example.

I'm looking more for a general solution for which you don't have to
manipulate the rule itself, and which should ideally work with as little
knowledge of the rule as possible.

Just see through which loops STD5_dump_match (in the same dir as STD.pm)
has to jump to get a grab of the parse tree in the right order.

Moritz

--
Moritz Lenz
http://perlgeek.de/­ | http://perl-6.de/ | http://sudokugarden­.de/
Add comment
Larry Wall 13 October 2008 20:47:30 permanent link ]
 On Sun, Oct 12, 2008 at 05:34:49PM +0200, Moritz Lenz wrote:
: Patrick R. Michaud wrote:
: > On Sun, Oct 12, 2008 at 11:44:05AM +0200, Moritz Lenz wrote:
: >> When we write regexes, we generally capture stuff in a way that makes
: >> the following semantic analysis easier. For example we could have a
: >> regex m/ <this>+ <that>? <this>*/ if we're only interested in the match
: >> trees of what <this> and <that> matches, not their respective order.
: >> [...]
: >> But if you want to re-used the match tree for something different (say,
: >> instead of doing a semantic analysis we want to do syntax hilighting)
: >> it's rather hard to reconstruct the original text, and what part of it
: >> was matched by which subrule.
: >
: > Perhaps aliases...?
: >
: > m/ <this>+ <that>? <andthen=this>* /
: >
: > This is probably not exactly what you're looking for, but
: > that would be what I would look at for this specific example.
:
: I'm looking more for a general solution for which you don't have to
: manipulate the rule itself, and which should ideally work with as little
: knowledge of the rule as possible.
:
: Just see through which loops STD5_dump_match (in the same dir as STD.pm)
: has to jump to get a grab of the parse tree in the right order.
:
: Moritz

Yes, funny thing is I was just thinking about the same thing this
morning after Mitchell Charity noticed that elsifs were missing
from the tree. It will be relatively trivial to do this with STD,
since it already produces a general mapping from position to hash,
which it uses to cache whitespace matches and line numbers, but could
easily record what matched where. (See the .<_> hash for that.)
In my case, I was wanting to find the set of non-whitespace things
that are parsed but don't end up in the parse tree. Maybe the :keepall
modifier needs access to something like this as well.

It may also let me remove the kludge whereby ~ remembers the delimiters
on either side.

It could also revolutionize the implementation of split. :)­

My big question is how best to make this ordered info available within
a Match, given that we currently use the Positional role for something
else. An argument could be made that this info is more important than
revealing $0,$1 etc at the top level of the Match, that is, that split
semantics are more natural than comb semantics for @($/). One data
point is that the STD grammar uses very little $0 and then only as
a named parameter that happens to have a numeric name. So we could
easily demote $0 etc to meaning $/.numbered[0] or some such. Of course,
it goes the other way too, and we can reveal the splits via a .split
method or some such. Plus we can have multiple levels of splitting
semantics, so then *they'd* be fighting over Positional if we made
one of them default.

So I'm thinking @($/) stays the way it is, but .splits might return
the top-level splits for a given rule, where strings are intermixed
with child tree nodes, whereas something like .allsplits might return
all the ordered strings along with mappings to what parsed them.

If we did that, then there's the question of whether .splits needs to
run the pattern lazily so that we can do a limited /':'/.splits(4)
and such. That may turn out to be abuse of the lazy system though.
And technically, that regex *isn't* binding the colons to a child
node, so there's a little semantic mismatch there as well, since a
split implemented in terms of .splits would look more like /.*?(':')/.
So maybe .splits is the wrong name. Suggestions welcome.

The cool thing about .allsplits is that if you doing, say, syntax
highlighting on the fly in an editor, it might be relatively easy to
run down the list and determine top-level nodes that limit how much
needs to be reparsed. Contrariwise, with the "fate" system of STD it
might even be relatively easy to put the parser back into a state
that was deeply recursive and restart the parse at any point.

'Course, "relatively easy" is one o' them relative concepts... :)­

Larry
Add comment
Larry Wall 13 October 2008 20:54:36 permanent link ]
 Or maybe we're not thinking big enough here. Maybe we're looking at
a generalized tree query language that, as limiting cases, defines the
.splits and .allsplits as (re)linearized query results, where .splits
linearizes the top level nodes, and .allsplits linearizes the leaves,
but may intermediate linearizations are possible. Don't want to
get stuck into binary thinking here...

Larry
Add comment
Aristotle Pagaltzis 13 October 2008 21:46:50 permanent link ]
 * Larry Wall <larry@wall.org> [2008-10-13 19:00]:
Maybe we're looking at a generalized tree query language

That’s an intriguing observation. Another case for having some
XPath-ish facility in the language?

Regards,
--
Aristotle Pagaltzis // <http://plasmasturm­.org/>
Add comment
 

Add new comment

As:
Login:  Password:  
 
 
  
 
Пожалуйста, относитесь к собеседникам уважительно, не используйте нецензурные слова, не злоупотребляйте заглавными буквами, не публикуйте рекламу и объявления о купле/продаже, а также материалы нарушающие сетевой этикет или УК РФ.


QAIX > Perl web-programming > Store captures and non-captures in source-string order 13 October 2008 21:46:50

see also:
STDOUT problem
RE: Telnet simulating SMTP
a perl module to access the header of…
пройди тесты:
see also:
How to make a Christmas photo…
Unable to Print Embedded Pictures in…
:-$

  Copyright © 2001—2008 QAIX
Idea: Miсhael Monashev
Помощь и задать вопросы можно в сообществе support.qaix.com.
Сообщения об ошибках оставляем в сообществе bugs.qaix.com.
Предложения и комментарии пишем в сообществе suggest.qaix.com.
Информация для родителей.
Write us at:
If you would like to report an abuse of our service, such as a spam message, please .