Playing Regex Golf with Genetic Programming A. Bartoli, A. De Lorenzo, E. Medvet, F. Tarlao University of Trieste, Italy
The problem (“Regex Golf”) ● ●
Specific kind of code golf Writing the shortest regular expression which: ○ ○
matches all the strings in a given list does not match any strings in another list
Regex Golf - Naive solution Positive examples
● ● ● ● ● ●
The Phantom Menace Attack of the Clones Revenge of the Sith A New Hope The Empire Strikes Back Return of the Jedi
Negative examples ● ● ● ● ● ● ● ● ●
The Wrath of Khan The Search for Spock The Voyage Home The Final Frontier The Undiscovered Country Generations First Contact Insurrection Nemesis
The Phantom Menace|Attack of the Clones|Revenge of the Sith|A New Hope|The Empire Strikes Back|Return of the Jedi
Regex Golf - Best solution Positive examples
● ● ● ● ● ●
Negative examples ● ● ● ● ●
The Phanto Phantomm Menace Menace Attack of the the Clones Clones Revenge of the the Sith Sith A New Hope The Empire Strikes Back Return of the the Jedi Jedi
● ● ● ●
m | [tN]|B
The Wrath of Khan The Search for Spock The Voyage Home The Final Frontier The Undiscovered Country Generations First Contact Insurrection Nemesis
16 difficult instances
A lot of vibrant discussions...
...really a lot...
February 26-th, 2014
Automatic Regex Golf
Our previous GP-based tool Automatic regex generation from examples For data extraction
IEEE Computer, GECCO Hot Off the Press http://regex.inginf.units.it
Key observations ● We need a classifier---not an extractor ○ No need to identify boundaries
● No need to infer a general pattern ● No need to process streams with unknown items ○ We need to “overfit”
No greedy/lazy quantifiers (execution time too long to be practical)
● Leave nodes: ○ ○
a-z, A-Z, ^, $, .. problem-dependent elements (next slide)
Problem-dependent elements ● All characters in matches ● All partial ranges including those characters ● All “most useful” n-grams (n=2,3,4) ○ Build all n-grams ○ Score each n-gram based on its frequency: +1 for each match, -1 for each unmatch ○ Rank n-grams ○ Select the smallest set totalling M points (M being the number of matches)
Problem in detail ● 16 instances, each with: ○ Set of matches M ○ Set of unmatches U ○ Weight w (a “difficulty” coefficient)
● Score of regex r on a given instance: ○ w * #misclassifications - length(r)
Problem in detail
Baseline ● GP-RegexExtract (our data extraction tool) ● “Norvig” (January 2014) ○ Deterministic algorithm widely discussed on the web ○ IMPORTANT ■ Not developed for this challenge ■ Designed for completely avoiding misclassifications ■ Comparison not fully fair...but the only algorithm we were aware of
Results: Warning ● The web site does not collect scores/rankings ● Programmers advertised solutions on forums ○
GitHub, Reddit, Hacker News
● Sometimes only scores without any evidence ● Sometimes slightly improving earlier results by other programmes ● No evidence of time spent
Great Results ! ● 6-th/8-th worldwide ○ At the time ○ To the best of our knowledge
● Without any hint from other programmers ● Much better than the baseline
Execution time, Actual regexes
Summary ● Evolutionary computation has reached a level in which it may successfully compete with human programmers ● In scenarios explicitly designed to test their practical skills and creativity ● And, It may do so without any starting hint or external help.
Web application A web-based prototype is public available at http://regex.inginf.units.it/golf
Our previous GP-based tool. Automatic regex generation from examples. For data extraction. IEEE Computer, GECCO Hot Off the Press http://regex.inginf.units.it ...
publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ..... acceptable for playing to a game of this kind. Finally, we ...
This matches the end of a line. . The period or dot matches any character. \d. Matches any single digit. \w. Matches any single alphanumeric characters or underscore. \s. Matches whitespaces including tabs and line breaks. *. The asterisk or star sig
A significant piece of software was developed that places a fo- cus on providing the ... Penousal Machado has developed an evolutionary art program called NEvAr (Neu- ral Evolutionary Art) [2]. ..... using adaptive mutation. The mutation rate is ...
realm of computer learning. This paper describes a ... using computer-based learning. ... All movement for both bots and the ball is defined in terms of a "tick. .... This toolkit provides built-in templates for doubly-linked lists and sortable array
Jul 7, 2010 - Abstract. Cartesian Genetic Programming is a form of genetic ... 3. / Divide data presented to inputs (protected) ..... The To Do list isn't too big.
Jul 7, 2010 - Dept of Computer Science. Memorial ... âThe automatic evolution of computer programs .... P ro ba bilit y o f S uc c e s s f o r 10 0 R uns. 0.2. 0.4.
both aesthetic and real terrains (without requiring a database of real terrain data). Additionally ... visualisation of weather and other environmental attributes;.
Swap node mutation Exchange a primitive with another of the same arity (prob. ... the EKM, the 5 solutions out of 10 runs with best training error are assessed .... the evolution archive, to estimate the probability for an example to be noisy and.
ing for providing the fitness function has the advantage over testing that all the executions ...... In: Computer Performance Evaluation / TOOLS 2002, 200â204. 6.
The graph has a set of ni in- ... The genotype is of fixed length however the graph described by it is not. ..... 7 shows some images evolved for 500 generations.
Nov 24, 2008 - International Journal of Computer Games Technology ... The Genetic Terrain Programming technique, based on evolutionary design with Genetic .... Collision detection is greatly simplified if one of the objects is a height map, because o
answer the query (button âExtractâ or âDo not extractâ, re- spectively). Otherwise, when the user has to describe a more complex answer, by clicking the âEditâ button the user may extend the selection boundaries of the ..... table, cell (
Jul 11, 2007 - [email protected]. ABSTRACT. In nature ... republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a ...