Custom languages in semantic version control

Tuesday, September 08, 2015 5 Comments

Following our path towards a semantic version control, we are totally aware of the great number of programming languages out there. Even if we natively support popular languages –such as C#, Java or C– we can't expect to cover all that ever-growing vastness. But we wouldn't like to let the Plastic SCM community think we've forgotten about them, either.

This is why we've allowed external language parsers to be plugged into Plastic since release Plastic SCM 5.4.16.689. Now, developers can collaborate with the community to implement custom parsers, which means support of virtually any possible language! There are, however, some rules that all custom parsers must follow in order to successfully communicate with the Plastic SCM semantic version control engine.

Take a look at the outcome:

Wouldn't you love to have this kind of information right on your version control GUI? Keep reading, then!

How to implement your custom language parser

First of all, the parser binary must accept two command-line arguments:

customparser.exe shell <flag-file>

The first argument will be always shell, which tells your parser to indefinitely wait for user input (shell mode).

The second argument contains the path to a flag file. The parser will need to create this file once it's ready to start parsing. It's useful to delay parsing requests until the parser has completed its initialization routines (if there are any).

At this point, the parser should be waiting for input lines. Plastic SCM will write line pairs in the parser standard input. The first line will contain the input path, pointing to the source code file retrieved from Plastic. The second line will contain the output path, indicating the expected path where the parsed results will appear. Finally, the parser should exit if a line containing just the keyword "end" is received.

The parser standard input will typically look like this:

C:\Users\Developer\AppData\Local\Temp\mysourcecode.lang
C:\Users\Developer\AppData\Local\Temp\parsed_mysourcecode.yaml
C:\Users\Developer\AppData\Local\Temp\infiniteloop.lang
C:\Users\Developer\AppData\Local\Temp\parsed_infiniteloop.yaml
end

The parser must write "OK" on the standard output after a pair of lines is received and the parse operation was successful, or "KO" if the parsing fails.

Parser output format

The parsing results are expected to be written in YAML. As Míryam -our semantic expert engineer- explains in our forum, we chose YAML over other options (plain text, JSON, XML) for three main reasons:

  • There are YAML parsers available to all languages
  • YAML is a superset of JSON, so anyone who likes JSON better may write their results in that format and Plastic would still be able to understand it!
  • It is human-readable.

All YAML output files will be encoded using UTF-8 and they can contain 3 types of data structures: file, container and terminal node.

File

It is the root node, unique and required.

Fields:

  • type - file
  • name - path of the file
  • locationSpan - row and column where the file starts and ends (optional)
  • footerSpan - start and end char where the file starts and ends
  • parsingErrorsDetected - Boolean, whether or not the file contains parsing errors
  • children - set of containers and/or terminal nodes inside the file. If there aren't any, this field shouldn't be specified.
  • parsingError - set of parsing errors (optional, see description below)

Container

Fields:

  • type - relevant, generic name of the container in the current programming language
  • name - actual name of the container
  • locationSpan - row and column where the container starts and ends (optional)
  • headerSpan - start and end chars where the header of the container starts and ends
  • footerSpan - start and end chars where the footer of the container starts and ends. This field should be set to [0, -1] if unexisting
  • children - set of containers and/or terminal nodes present inside the current container. If there aren’t any, this field shouldn’t be specified.

Terminal node

Fields:

  • type - relevant, generic name of the node in the current programming language
  • name - actual name of the node
  • locationSpan - row and column where the node starts and ends (optional)
  • span - start and end char where the node starts and ends

Parsing error

Fields:

  • location - row and column where the error was detected
  • message - error message

Sample output file

For instance, if we parsed the following Delphi contents:

unit Unit1;
interface

type
  TTest = class(TObject)
    procedure Test;
  end;

implementation

{ TTest }

procedure TTest.Test;
begin
  //
end;

end.

We would obtain the following YAML result (using \r\n as the line separators):

 
---
type : file
name : /path/to/file
locationSpan : {start: [1,0], end: [19,4]}
footerSpan : [0, -1]
parsingErrorsDetected : false
children :

  - type : unit
    name : Unit1
    locationSpan : {start: [1,0], end: [1,13]}
    span : [0, 12]

  - type : interface
    name : interface
    locationSpan : {start: [2,0], end: [9,0]}
    headerSpan : [13, 25]
    footerSpan : [0, -1]
    children :

      - type : type
        name : type
        locationSpan : {start: [4,0], end: [9,0]}
        headerSpan : [26, 33]
        footerSpan : [0, -1]
        children :

          - type : class
            name : TTest
            locationSpan : {start: [6,0], end: [9,0]}
            headerSpan : [34, 59]
            footerSpan : [81, 88]
            children :

              - type : procedure declaration
                name : Test
                locationSpan : {start: [7,0], end: [7,21]}
                span : [60, 80]

  - type : implementation
    name : implementation
    locationSpan : {start: [9,0], end: [19,4]}
    headerSpan : [89, 106]
    footerSpan : [164, 169]
    children :

      - type : procedure 
        name : TTest.Test
        locationSpan : {start: [11,0], end: [18,0]}
        span : [107, 163]

Connecting your parser to Plastic

Once everything is in place, you just need to tell Plastic where to find the parser executable and which file extensions should be matched.

To that matter, edit or create a file called externalparsers.conf in your local configuration directory (C:\Users\<your-username>\AppData\Local\plastic4). This file will contain file extensions and their associated parser executables. This is a valid example:

.pas=C:\Users\Developer\SemanticMergeDelphi\pas2yaml.exe
.js=C:\Program Files\JavaScriptMagicParser\bin\js2yaml.pas 

Once Plastic is restarted, any controlled file with one of those extensions will be processed by the semantic engine. The related semantic version control information will be displayed on the diff views like any C#/Java/C controlled file. You've unleashed the power of Plastic!

Example: Delphi parser

To help you get familiar with this external parser system we're going to guide you through a real case scenario: using an actual Delphi language parser. This was developed as a result of a forum thread: http://www.plasticscm.net/index.php?/topic/1857-delphi-parser-development/ by André Mussche and Jeroen Vandezande (great job, guys!). This parser was initially developed for SemanticMerge, but since Plastic uses the same inner mechanism it's a perfect match for our external parser test. You can download the parser from GitHub: https://github.com/andremussche/SemanticMergeDelphi.

We'll assume the parser to be located at C:\Users\Miguel\SemanticMergeDelphi, so the parser executable path will be C:\Users\Miguel\SemanticMergeDelphi\pas2yaml.exe.

Now I'll edit the externalparsers.conf file (C:\Users\Miguel\AppData\Local\plastic4\externalparsers.conf) and add the following line:

.pas=C:\Users\Miguel\SemanticMergeDelphi\pas2yaml.exe

After that, open the Plastic GUI. We'll create a new workspace called semanticdelphi at C:\Users\Miguel\wkspaces\semanticdelphi, pointing to a new repository called semanticdelphi as well.

Clicking on the "OK" button will create the workspace and the Plastic GUI will open up afterwards. We'll use GitSync to retrieve the contents of a sample GitHub repository for our tests: https://github.com/fabriciocolombo/delphi-rest-client-api.

Once the replication is complete we'll go back to the workspace explorer and we'll update our workspace to download the source files.

Let's create some differences to test our new parser! I'll edit /src/HttpConnectionWinHttp.pas to reorder some procedure definitions and change their implementations. Let's have a look at the embedded diff view on the pending changes view:

Check that out! The mighty Plastic semantic engine read the output of the Delphi parser and then it has detected which methods were renamed, which ones were moved and which ones had their contents modified. For instance, THttpConnectionWinHttp.Get() was moved up 3 positions and then it had its body modified to include a new comment. You can click on the downward arrow next to the "C" (changed) symbol at the method signature (or control + click) to display just the differents of the method code:

But we're not going to stop right here! Let's see what Plastic is capable of. We're going to move some methods to a new file and add it to version control. Plastic is designed to analyze a set of files and extract refactors across all of them. We published this feature some days ago, you should definitely see it for yourself!

Back to our sample, I'm taking the HTTP method procedures out of /src/HttpConnectionWinHttp.pas and place them into a new file: /src/http/HttpMethods.pas. When the changes are saved on disk, we'll just add the new directory and file to version control. This is how the pending changes view looks like now:

As expected, the moved methods appear as removed. We'll check in these changes and we'll open the new changeset differences right after. Let's click on the "Analyze refactors" and see what happens:

It worked! You can see how Plastic found out what we had done and properly arranged all methods moved across files.



Now it's your turn! Find a suitable parser for your language of choice and let Plastic understand your code! Remember to stop by our other nice blogposts about semantic version control and tracking refactored code across files.

And don't forget to have some semantic fun!

5 comments:

  1. Does this mean you're not going to finish the C++ parser ?

    ReplyDelete
    Replies
    1. Not really. I don't think the C++ parser can be implemented externally unless I'm missing a key simplification, which would be great to find.

      Parsing C++ requires accessing include files, even for a single source file. And the files to be accessed could be outside the working copy, or belong to previous versions, so some sort of interface to access the right file on src and dst will be requried.

      Delete
    2. Does the parsing (for c++) have to go so deep as to parse included files and understand every statement? Can't we start by getting the basic structure right - classes, structs, methods etc? Look for keywords and delimiters, seeing unknown (type)names as placeholders. I may be too naive here, and I realize that ill-formed preproc macros can easily break the parser. Templates are difficult too, I'm afraid...

      Delete
    3. The problem is that in C++ you can't "start small". And customers don't like "starting small things" because every possible little detail you postpone to be implemented "later on" will be immediately requested and highlighted as a bug fix when not a show-stopper :-S

      Going back to the C++ specific stuff: we're using libclang to parse C++. Trying to implement your own parser, for C++, is ... well, crazy, because it is probably the hardest to parse.

      And then, the test we perform to approve a parser is as follows:
      * Parse about 2 million files.
      * Convert them to our intermediate structure (sort of an AST).
      * Rebuild the original file from the intermediate.
      * Check we don't break anything - if we do the parsing wouldn't be good enough for merging... cause coders, for some reason, don't want the merge tool to break their files XD.

      The C++ parser can't guarantee that unless you parse the entire declarations correctly...

      We'll see if we can do something for diffs, leaving merges out...

      Delete
  2. Cool! I guess we would have to configure additional include paths for the parser, then. But I see the problem when you have to load includes from each of the involved changesets in a merge... Thanks for the comment, I was clearly too naive! :P

    ReplyDelete