Playing Regex Golf with Genetic Programming A. Bartoli, A. De Lorenzo, E. Medvet, F. Tarlao University of Trieste, Italy

The problem (“Regex Golf”) ● ●

Specific kind of code golf Writing the shortest regular expression which: ○ ○

matches all the strings in a given list does not match any strings in another list

Regex Golf - Naive solution Positive examples

● ● ● ● ● ●

The Phantom Menace Attack of the Clones Revenge of the Sith A New Hope The Empire Strikes Back Return of the Jedi

Negative examples ● ● ● ● ● ● ● ● ●

The Wrath of Khan The Search for Spock The Voyage Home The Final Frontier The Undiscovered Country Generations First Contact Insurrection Nemesis

The Phantom Menace|Attack of the Clones|Revenge of the Sith|A New Hope|The Empire Strikes Back|Return of the Jedi

Regex Golf - Best solution Positive examples

● ● ● ● ● ●

Negative examples ● ● ● ● ●

The Phanto Phantomm Menace Menace Attack of the the Clones Clones Revenge of the the Sith Sith A New Hope The Empire Strikes Back Return of the the Jedi Jedi

● ● ● ●

m | [tN]|B

The Wrath of Khan The Search for Spock The Voyage Home The Final Frontier The Undiscovered Country Generations First Contact Insurrection Nemesis

16 difficult instances

A lot of vibrant discussions...

...really a lot...

February 26-th, 2014

Automatic Regex Golf

Our previous GP-based tool Automatic regex generation from examples For data extraction

IEEE Computer, GECCO Hot Off the Press http://regex.inginf.units.it

Key observations ● We need a classifier---not an extractor ○ No need to identify boundaries

● No need to infer a general pattern ● No need to process streams with unknown items ○ We need to “overfit”

Our approach ● Candidate regex = tree ● Internal nodes: usual regex operators ○

No greedy/lazy quantifiers (execution time too long to be practical)

● Leave nodes: ○ ○

a-z, A-Z, ^, $, .. problem-dependent elements (next slide)

Problem-dependent elements ● All characters in matches ● All partial ranges including those characters ● All “most useful” n-grams (n=2,3,4) ○ Build all n-grams ○ Score each n-gram based on its frequency: +1 for each match, -1 for each unmatch ○ Rank n-grams ○ Select the smallest set totalling M points (M being the number of matches)

Evolutionary search ● 500 individuals, 1000 generations ● 32 independent searches ● Multiobjective fitness (NSGA-II) ● Minimize ○ Number of misclassifications ○ Length

Problem in detail ● 16 instances, each with: ○ Set of matches M ○ Set of unmatches U ○ Weight w (a “difficulty” coefficient)

● Score of regex r on a given instance: ○ w * #misclassifications - length(r)

Problem in detail

Baseline ● GP-RegexExtract (our data extraction tool) ● “Norvig” (January 2014) ○ Deterministic algorithm widely discussed on the web ○ IMPORTANT ■ Not developed for this challenge ■ Designed for completely avoiding misclassifications ■ Comparison not fully fair...but the only algorithm we were aware of

Results: Warning ● The web site does not collect scores/rankings ● Programmers advertised solutions on forums ○

GitHub, Reddit, Hacker News

● Sometimes only scores without any evidence ● Sometimes slightly improving earlier results by other programmes ● No evidence of time spent

Great Results ! ● 6-th/8-th worldwide ○ At the time ○ To the best of our knowledge

● Without any hint from other programmers ● Much better than the baseline

Execution time, Actual regexes

Summary ● Evolutionary computation has reached a level in which it may successfully compete with human programmers ● In scenarios explicitly designed to test their practical skills and creativity ● And, It may do so without any starting hint or external help.

Web application A web-based prototype is public available at http://regex.inginf.units.it/golf

Playing Regex Golf with Genetic Programming

Our previous GP-based tool. Automatic regex generation from examples. For data extraction. IEEE Computer, GECCO Hot Off the Press http://regex.inginf.units.it ...

1MB Sizes 2 Downloads 146 Views

Recommend Documents

Playing Regex Golf with Genetic Programming
publish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ..... acceptable for playing to a game of this kind. Finally, we ...

Regex with Notepad++ Cheatsheet.pdf
This matches the end of a line. . The period or dot matches any character. \d. Matches any single digit. \w. Matches any single alphanumeric characters or underscore. \s. Matches whitespaces including tabs and line breaks. *. The asterisk or star sig

Evolutionary Art with Cartesian Genetic Programming
A significant piece of software was developed that places a fo- cus on providing the ... Penousal Machado has developed an evolutionary art program called NEvAr (Neu- ral Evolutionary Art) [2]. ..... using adaptive mutation. The mutation rate is ...

Game Playing Пith Genetic Algorithms - GitHub
realm of computer learning. This paper describes a ... using computer-based learning. ... All movement for both bots and the ball is defined in terms of a "tick. .... This toolkit provides built-in templates for doubly-linked lists and sortable array

Abstract Contents Genetic Programming - Cartesian Genetic ...
Jul 7, 2010 - Abstract. Cartesian Genetic Programming is a form of genetic ... 3. / Divide data presented to inputs (protected) ..... The To Do list isn't too big.

Abstract Contents Genetic Programming - Cartesian Genetic ...
Jul 7, 2010 - Dept of Computer Science. Memorial ... ❖The automatic evolution of computer programs .... P ro ba bilit y o f S uc c e s s f o r 10 0 R uns. 0.2. 0.4.

Genetic Terrain Programming
both aesthetic and real terrains (without requiring a database of real terrain data). Additionally ... visualisation of weather and other environmental attributes;.

Genetic Programming for Kernel-based Learning with ...
Swap node mutation Exchange a primitive with another of the same arity (prob. ... the EKM, the 5 solutions out of 10 runs with best training error are assessed .... the evolution archive, to estimate the probability for an example to be noisy and.

Model Checking-Based Genetic Programming with an Application to ...
ing for providing the fitness function has the advantage over testing that all the executions ...... In: Computer Performance Evaluation / TOOLS 2002, 200–204. 6.

Evolutionary Art with Cartesian Genetic Programming
The graph has a set of ni in- ... The genotype is of fixed length however the graph described by it is not. ..... 7 shows some images evolved for 500 generations.

Research Article Breeding Terrains with Genetic Terrain Programming ...
Nov 24, 2008 - International Journal of Computer Games Technology ... The Genetic Terrain Programming technique, based on evolutionary design with Genetic .... Collision detection is greatly simplified if one of the objects is a height map, because o

Genetic Terrain Programming
regular structure, good for optimisation (rendering, collision ... optimisation approach, uses a database of pre-selected height map ... GenTP Tool. Results ...

Regex-based Entity Extraction with Active Learning and ...
answer the query (button “Extract” or “Do not extract”, re- spectively). Otherwise, when the user has to describe a more complex answer, by clicking the “Edit” button the user may extend the selection boundaries of the ..... table, cell (

Self-Modifying Cartesian Genetic Programming
Jul 11, 2007 - [email protected]. ABSTRACT. In nature ... republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a ...