Splitting source code identifiers usingBidirectional LSTM Recurrent Neural Network

Splitting source code identifiers using
Bidirectional LSTM Recurrent Neural Network

Vadim Markovtsev
   Waren Long
   Egor Bulychev
   Romain Keramitas
   Konstantin Slavnov
   Gabor Markowski

Programmers make rich use of natural language in the source code they write through identifiers and comments. Source code identifiers are selected from a pool of tokens which are strongly related to the meaning, naming conventions, and context. These tokens are often combined to produce more precise and obvious designations. Such multi-part identifiers count for 97% of all naming tokens in the Public Git Archive - the largest dataset of Git repositories to date. We introduce a bidirectional LSTM recurrent neural network to detect subtokens in source code identifiers. We trained that network on 41.7 million distinct splittable identifiers collected from 182,014 open source projects in Public Git Archive, and show that it outperforms several other machine learning models. The proposed network can be used to improve the upstream models which are based on source code identifiers, as well as improving developer experience allowing writing code without switching the keyboard case.

Submitted to: ML4P 2018 © Markovtsev et al. This work is licensed under the Creative Commons Attribution License.

Splitting source code identifiers using

Bidirectional LSTM Recurrent Neural Network

Vadim Markovtsev
vadim@sourced.tech and Waren Long
waren@sourced.tech and Egor Bulychev
egor@sourced.tech and Romain Keramitas
romain@sourced.tech and Konstantin Slavnov
konstantin@sourced.tech and Gabor Markowski

source{d}, Madrid, Spain

1 Introduction

The descriptiveness of source code identifiers is critical for readability and maintainability [21]. This property is hard to ensure by using exclusively single words. Therefore it is common practice to concatenate several multiple words into a single identifier. Whitespace characters in identifiers are forbidden by most programming languages, so there are naming conventions [5] like CamelCase or snake_case which specify the concatenation rules. It is possible to apply simple heuristics, backtrack those rules and restore the original words from identifiers. For example, FooBar or foo_bar are trivially disassembled into foo and bar. However, if there is a compound identifier consisting of only lowercase or only uppercase characters, splitting requires domain knowledge and cannot be easily performed.

According to our estimations, up to 10% of the identifiers are not splittable at all or not fully splittable by style driven heuristics; among the rest, 11% contain further splittable parts after heuristics. This leads to bigger vocabulary sizes, worse performance, and reduced quality of upstream investigation in the areas of source code analysis and Machine Learning on Source Code (MLonCode). A deep learning-based parser, capable of learning to tokenize identifiers from many training examples, can enhance the quality of research in topics like identifier embeddings [27], deduplication [22], topic modeling [25], and naming suggestions [9].

The main contributions of this paper are:

  • We built the biggest dataset of 47.0 million source code identifiers extracted from Public Git Archive [26] - The 182,014 most popular GitHub repositories.

  • We are the first to apply a recurrent neural network to split ”unsplittable” identifiers. We show that the character-level bidirectional LSTM type of recurrent neural network (RNN) performs better than the character-level bidirectional GRU, character-level convolutional neural network, the gradient boosted decision tree, the statistical dynamic programming model, and the unsmoothed maximum likelihood character-level model.

2 Identifier extraction

This section describes the source code identifier extraction pipeline which was used to generate the train dataset from the Public Git Archive.

We processed each Git repository with the source{d} engine [7] to determine the main branch and its head revision, took the files from that revision, and identifed the programming languages of those files. We extracted identifiers from files according to the identified language with babelfish [3] and pygments [6]. Babelfish is a self-hosted server for universal source code parsing, it converts code files into Universal Abstract Syntax Trees. We fall back to Pygments for those languages which are not supported yet by Babelfish. Pygments was developed to highlight source code and uses regular expressions, however it makes mistakes and introduces noise.

We obtained 60.6 million identifiers after removing duplicates. This number reduced to 47.0 after manual rule-based filtering of noisy output from Pygments. We then split the identifiers according to the common naming conventions. For example FooBarBaz becomes foo bar baz, and method_base turns into method base. The listing of the function code which implements the heuristics is provided in appendix A. We left only those which consisted of more than one part and obtained 43.6 million distinct subtoken sequences. Some identifiers aliased to the same subtoken sequence. The distribution of identifier lengths had a long tail as seen on Fig. 2, so we put a threshold of maximum identifier length to 40 characters. The length threshold further reduced the dataset to 41.7 million unique subtoken sequences. All the models we trained used as input the lowercase strings created by merging the subtoken sequences together and the corresponding indices of subtoken boundaries. Figure 2 depicts the head of the frequency distribution of the subtokens.

The raw dataset of 47.0 million identifiers is available for download on GitHub [4]. Currently available datasets of source code identifiers contain less than a million entities and focus on particular programming languages, such as Java [12].

Figure 1: Distribution of identifier lengths Figure 2: Distribution of most frequent subtokens

3 Baselines

We now describe the models to which we compare our character-level bidirectional LSTM recurrent neural network.

Maximum likelihood character-level model

Probabilistic maximum likelihood language models (ML LM) are typical for Natural Language Processing [28]. Given the sequence of characters representing an identifier, we evaluate for each character the probability that the subsequence is a prefix, and pick the prefix that maximizes that probability, assuming . We repeat this procedure from the character following the chosen prefix. In the case of prefixes for which we have no prior knowledge we slide the root forward until the match is found. Similarly to n-gram models [17], our character-level LM makes the Markov assumption that the sequence of characters is a memoryless stochastic process, so we assert that . We estimate these conditional probabilities using maximum likelihood [23]. We trained two unsmoothed models independently, corresponding to forward and backward reading direction. Finally, we combined them via the logical conjunction and disjunction. The tree depth was 11 due to the technical limitations - bigger depths require too much operating memory. The implementation was CharStatModel [24].

Dynamic programming

Inspired by the dynamic programming approach to splitting words [20], we implemented the similar solution based on word frequencies. By making the hypothesis that the words are independent from one another, we can model the probability of a sequence of words using frequencies computed on a corpus. We trained on the generic Wikipedia corpus and on the unique subtokens in our identifier dataset, either assuming Zipf prior or the posterior. Our implementation was based on wordninja [10].

The main limitation of the statistical approaches is their inability to predict out of vocabulary words, especially method and class names which represent the substantial portion of identifiers in the validation set. The only way to compensate this drawback is to increase the length of the context on which we compute priors, simultaneously worsening the data sparsity problem [9] and increasing the time and memory requirements.

Gradient boosting on decision trees

We trained the gradient boosting on decision trees (GBDT)) using XGBoost [13]. The tree input was a 10-character window with ”a”-aligned ASCII codes instead of one-hot encoding. We didn’t choose a larger window to avoid introducing noise, given bulk of our identifiers were shorter then the 40 character limit. The windows were centered at each split point and we also generated 80% negative samples at random non-split positions. The maximum tree depth was 30, the number of boosting trees was 50.

Character-level Convolutional Neural Network

We stacked 3 Inception layers [30], with 1-dimensional ReLU kernels spanning over 2, 3, 4, 6, 8, 12 and 16 one-hot encoded characters, and 64 dimensionality reducing ReLU kernels of size 1. Thus the output of each layer was shaped 40 by 64. The last layer was connected to the time-distributed dense layer with sigmoid activation and binary labels. There was no regularization as the dataset size was big enough and we used RMSProp optimizer [18].

4 Character-level bidirectional recurrent neural network

Character-level bidirectional recurrent neural networks (BiRNNs) [29] are a family of models that combine two recurrent networks moving through each character in a sequence in the opposite directions and starting from the opposite sides. BiRNNs are an effective solution for sequence modeling, so we tried them for the splitting task. Given that the tokens may be long, we chose LSTM [16] over vanilla RNNs to overcome the vanishing gradients problem. Besides, we compared LSTM to GRU [15] as GRU was shown to perform with similar quality but are faster to train.

Figure 3: BiLSTM network with one layer running on foobar. The vertical dashed line indicates the separation point.

Fig. 3 demonstrates the architecture of a BiLSTM network to split identifiers. It processes the characters of each identifier in both directions. The schema contains a single recurrent layer for simplicity, however, the real network is built with two stacked layers. The second recurrent layer is connected to the time-distributed dense layer with binary outputs and sigmoid activation. An output of 1 means the character is a split point and 0 it is not. Sigmoid activation was used instead of softmax because there can be more than one split point per identifier.

We trained our BiLSTM network on two NVIDIA GTX 1080 GPUs using Keras [14] with a Tensorflow backend [8]. It took approximately 1.1 hours to complete 3.5 epochs. Table 4 lists the hyperparameters we chose using Hyperopt [11]. The training curves are on Figure 4.

RNN sequence length 40 Layer sizes 256, 256 Batch size 512 Epochs 10 Optimizer Adam [19] Learning rate 0.001
Table 1: Network train parameters
Figure 4: Training curves for the BiLSTM

5 Evaluation

We divided the dataset into 80% train and 20% validation and calculated precision, recall and score for each of the models. Precision is defined as the ratio of correct splitting predictions and the total number of predictions, recall as the ratio of correct splitting predictions and the ground truth number of splits, and score is the harmonic average of precision and recall. The results are shown on Fig. 5 and Table 5. The worst models are clearly the statistical ones, however, the conjunction of character-level ML LMs achieved the highest precision among all with 96.6%. Character-level CNN is close to the top, it has great evaluation speed and can be chosen if the run time is important. LSTM performed better than GRU and achieved the highest score with 95% precision and 96% recall.

Model Precision Recall Char. ML LM 0.563 0.936 0.703 Char. ML LM 0.966 0.573 0.719 Stat. dyn. prog., Wiki 0.741 0.912 0.818 Stat. dyn. prog., Zipf 0.937 0.783 0.853 Stat. dyn. prog., posterior 0.931 0.892 0.911 GBDT 0.931 0.924 0.928 Char. CNN 0.922 0.938 0.930 Char. BiGRU 0.945 0.955 0.949 Char. BiLSTM 0.947 0.958 0.952
Table 2: Evaluation results
Figure 5: Model comparison,  isocurves are dashed

6 Applications

The presented identifier splitters reduce the number of unique subtokens by 50%. We ran the BiLSTM model on the subtoken sequences from the dataset and generated refined subtoken sequences. We then measured the new number of unique identifier parts, which was reduced from 2,919,170 to 1,462,293. Samples of identifiers split by heuristics and our model are listed in appendix B. Smaller vocabulary size leads to faster training of the upstream models such as identifier embeddings based on the structural co-occurrence scope as a context [2] or topic models of files and projects [25].

It is also possible to use our model to automatically split identifiers written in the same case without whitespace on the keyboard. This simplifies and speeds up typing the code provided by the number of splitting errors is low enough. Depending on the naming style, the described algorithm may save ”Shift” or ”Shift + Underscore” keystrokes. The reached quality metrics are good enough, our network makes an error with 50% probability after identifiers assuming that each identifier contains a single split point.

7 Conclusion

We created and published a dataset with 47.0 million distinct source code identifiers extracted from Public Git Archive, the largest one to date. We trained several machine learning models on that dataset and showed that the character-level bidirectional LSTM recurrent neural network (BiLSTM) performs best, reaching 95% precision and 96% recall on the validation set. To our knowledge, it is the first time RNNs were applied to the source code identifier split problem. BiLSTM significantly (by 2 times) reduces the core vocabulary size in upstream problems and is good enough to improve the speed at which people write code.


  • [1]
  • [2] blog.sourced.tech/post/id2vec. blog.sourced.tech/post/id2vec/.
  • [3] github.com/bblfsh. github.com/bblfsh.
  • [4] github.com/src-d/datasets/Identifiers. github.com/src-d/datasets/Identifiers.
  • [5] Naming convention (programming). wikiwand.com/en/Naming_convention_(programming).
  • [6] Pygments - generic syntax highlighter. pygments.org/.
  • [7] src-d/engine. github.com/src-d/engine.
  • [8] Martín Abadi et al.: TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. tensorflow.org.
  • [9] Miltiadis Allamanis, Earl T. Barr, Christian Bird & Charles Sutton (2015): Suggesting Accurate Method and Class Names. In: FSE, ACM, pp. 38–49, doi:10.1145/2786805.2786849.
  • [10] Derek Anderson: keredson/wordninja. github.com/keredson/wordninja.
  • [11] James Bergstra, Dan Yamins & David D. Cox (2012): Making a Science of Model Search. CoRR. Available at https://arxiv.org/abs/1209.5111.
  • [12] Simon Butler, Michel Wermelinger, Yijun Yu & Helen Sharp (2013): INVocD: Identifier Name Vocabulary Dataset. In: MSR, IEEE, pp. 405–408, doi:10.1109/MSR.2013.6624056.
  • [13] Tianqi Chen & Carlos Guestrin (2016): XGBoost: A Scalable Tree Boosting System. In: KDD, ACM, pp. 785–794, doi:10.1145/2939672.2939785.
  • [14] François Chollet et al.: Keras. keras.io.
  • [15] Junyoung Chung, Çaglar Gülçehre, KyungHyun Cho & Yoshua Bengio (2014): Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. CoRR, doi:10.1007/978-3-319-67220-5_3.
  • [16] Alex Graves (2008): Supervised sequence labelling with recurrent neural networks. Ph.D. thesis, Technical University Munich, doi:10.1007/978-3-642-24797-2.
  • [17] Abram Hindle, Earl T. Barr, Zhendong Su, Mark Gabel & Premkumar Devanbu (2012): On the Naturalness of Software. In: ICSE, IEEE, pp. 837–847, doi:10.1145/2902362.
  • [18] Geoffrey Hinton: RMSProp. www.cs.toronto.edu/~tijmen/csc321/slides/lecture_slides_lec6.pdf.
  • [19] Diederik P. Kingma & Jimmy Ba (2014): Adam: A Method for Stochastic Optimization. CoRR. Available at https://arxiv.org/abs/1412.6980.
  • [20] Philipp Koehn & Kevin Knight (2003): Empirical Methods for Compound Splitting. In: EACL, pp. 187–193, doi:10.3115/1067807.1067833.
  • [21] D. Lawrie, C. Morrell, H. Feild & D. Binkley (2006): What’s in a Name? A Study of Identifiers. In: ICPC, IEEE, pp. 3–12, doi:10.1109/ICPC.2006.51.
  • [22] Cristina V. Lopes, Petr Maj, Pedro Martins, Vaibhav Saini, Di Yang, Jakub Zitny, Hitesh Sajnani & Jan Vitek (2017): DéJà Vu: A Map of Code Duplicates on GitHub. Proc. ACM Program. Lang. 1(OOPSLA), pp. 84:1–84:28, doi:10.1145/3133908.
  • [23] Christopher D. Manning & Hinrich Schütze (1999): Foundations of Statistical Natural Language Processing. MIT Press, doi:10.1017/S1351324902212851.
  • [24] Vadim Markovtsev: vmarkovtsev/CharStatModel. github.com/vmarkovtsev/CharStatModel.
  • [25] Vadim Markovtsev & Eiso Kant (2017): Topic modeling of public repositories at scale using names in source code. CoRR. Available at https://arxiv.org/abs/1704.00135.
  • [26] Vadim Markovtsev & Waren Long (2018): Public Git Archive: a Big Code dataset for all. In: MSR, ACM, doi:10.1145/3196398.3196464.
  • [27] Trong Duc Nguyen, Anh Tuan Nguyen, Hung Dang Phan & Tien N. Nguyen (2017): Exploring API Embedding for API Usages and Applications. In: ICSE, IEEE, pp. 438–449, doi:10.1109/ICSE.2017.47.
  • [28] Tung Thanh Nguyen, Anh Tuan Nguyen, Hoan Anh Nguyen & Tien N. Nguyen (2013): A Statistical Semantic Language Model for Source Code. In: FSE, ACM, pp. 532–542, doi:10.1145/2491411.2491458.
  • [29] M. Schuster & K.K. Paliwal (1997): Bidirectional Recurrent Neural Networks. Trans. Sig. Proc. 45(11), pp. 2673–2681, doi:10.1109/78.650093.
  • [30] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke & Andrew Rabinovich (2015): Going Deeper with Convolutions. In: CVPR, pp. 1–9, doi:10.1109/CVPR.2015.7298594.

Appendix A Identifier splitting algorithm, Python 3.4+

NAME_BREAKUP_RE = re.compile(r”[ˆa-zA-Z]+”)
min_split_length = 3
def split(token):
  token = token.strip()
  prev_p = [””]
  def ret(name):
    r = name.lower()
    if len(name) >= min_split_length:
      yield r
      if prev_p[0]:
        yield prev_p[0] + r
        prev_p[0] = ””
      prev_p[0] = r
  for part in NAME_BREAKUP_RE.split(token):
    if not part:
    prev = part[0]
    pos = 0
    for i in range(1, len(part)):
      this = part[i]
      if prev.islower() and this.isupper():
        yield from ret(part[pos:i])
        pos = i
      elif prev.isupper() and this.islower():
        if 0 < i - 1 - pos <= min_split_length:
          yield from ret(part[pos:i - 1])
          pos = i - 1
        elif i - 1 > pos:
          yield from ret(part[pos:i])
          pos = i
      prev = this
    last = part[pos:]
    if last:
      yield from ret(last)

Appendix B Examples of identifiers from the dataset processed by heuristics and the BiLSTM model

Input identifier Output TokenParser Output BiLSTM
OMX_BUFFERFLAG_CODECCONFIG omx bufferflag codecconfig omx buffer flag codec config
metamodelength metamodelength meta mode length
rESETTOUCHCONTROLS r esettouchcontrols reset touch controls
ID_REQUESTRESPONSE id requestresponse id request response
%afterfor afterfor after for
simpleblogsearch simpleblogsearch simple blog search
namehash_from_uid namehash from uid name hash from uid
GPUSHADERDESC_GETCACHEID gpushaderdesc getcacheid gpu shader desc get cache id
oneditvaluesilence oneditvaluesilence on edit value silence
XGMAC_TX_SENDAPPGOODPKTS xgmac tx sendappgoodpkts xgmac tx send app good pkts
closenessthreshold closenessthreshold closeness threshold
test_writestartdocument test writestartdocument test write start document
dspacehash dspacehash d space hash
testfiledate testfiledate test file date
ASSOCSTR_SHELLEXTENSION assocstr shellextension assoc str shell extension
Comments 0
Request Comment
You are adding the first comment!
How to quickly get a good reply:
  • Give credit where it’s due by listing out the positive aspects of a paper before getting into which changes should be made.
  • Be specific in your critique, and provide supporting evidence with appropriate references to substantiate general statements.
  • Your comment should inspire ideas to flow and help the author improves the paper.

The better we are at sharing our knowledge with each other, the faster we move forward.
The feedback must be of minimum 40 characters and the title a minimum of 5 characters
Add comment
Loading ...
This is a comment super asjknd jkasnjk adsnkj
The feedback must be of minumum 40 characters
The feedback must be of minumum 40 characters

You are asking your first question!
How to quickly get a good answer:
  • Keep your question short and to the point
  • Check for grammar or spelling errors.
  • Phrase it like a question
Test description