simple

GitHub is a platform that provides hosting for software development version control using Git. It features an application programming interface to allow the software to interact with the platform. The enormous quantity of information Hosted in GitHub may be useful to make studies about the current presence of development tools in the open-source software development community. However, the search engine has restrictions that make it impossible to issue complex queries to the platform. In this report, it is described as an object-oriented and extensible solution, named QuantityEr, to obtain the number of search results of complex queries to GitHub by using the inclusion-exclusion principle. The mathematical deﬁnitions, as well as related concepts, are presented. The mathematical model is discussed. The application of general design and used development tools are presented. Also, the results of the execution examples are showed. It is concluded that the treated problem has been solved although more work may be done to improve the solution


Introduction
GitHub 1 is a platform that provides hosting for software development version control using Git 2 .It provides several collaboration features such as bug tracking, feature requests, task management, and wikis for every project.It also features an application programming interface (API) to allow software to interact with the platform 3 [1].Through this API a search engine can be accessed.The search engine allows users to find almost every single aspect across several projects, source codes and other areas and features of the platform 4 [2].A web page that serves as an interface to the search API is also available 5 .
As of August 2019, GitHub reports having over 40 million users and more than 100 million repositories 6 .This enormous quantity of information may be useful, among other things, to obtain the number of projects, source codes, issues, etc, that mention a set of technologies, tools, development libraries, etc, in order to make studies about the current presence of these tools in the open source software development community.Other kind of quantitative studies may be done as well [3].Examples of those kinds of research are [4][5][6][7].
However, the search engine has some restrictions 4 that make impossible to issue complex queries to the platform.According to the GitHub Developer Guide 4 , the restrictions are the following: • The Search API does not support queries that: are longer than 256 characters (not including operators or qualifiers).
have more than five AND, OR, or NOT operators.
• For authenticated requests can be made up to 30 requests per minute.For unauthenticated requests, the rate limit allows making up to 10 requests per minute.Furthermore, if the search is over source code files, especial restrictions apply 7 .
A system named GHTorrent have been already developed to ease the interaction with the large quantity of

Facultad de Ingeniería
Universidad La Salle, Arequipa, Perú facin.innosoft@ulasalle.edu.peexample, that search on source code is not allowed.Furthermore, the objective of the system is to interact with GitHub, which means that a future interaction with other platforms is not currently conceived.
A different kind of alternative is GH Archive13 which records events form GitHub 14 .The recorded data can be accessed through BigQuery 15 which allows any kind of SQL-like queries.GH Archive, although a powerful and flexible solution, does not constitute an alternative to explore the data stored in GitHub but a tool to explore the data that represents the interaction with GitHub.This means that, for example, searching inside public source code cannot be done with GH Archive.
Moreover, both of these systems are server like development tools and not client applications ready to use for making queries.
In the context of this article, complex queries are those that have many logical connectives and sub-expressions -for example: A OR (C AND (D OR E))-especially those that exceed the allowed number of logical operators.
By getting the results number of queries of this kind, analysis of the current presence of technologies might be done.Although many reporting tools has been developed none of them are capable of getting the number amount of complex queries directly to GitHub.Some of these tools are listed in https://www.gharchive.org/.
Another example not listed in previous URL is https://www.programcreek.com/.In that case the reports are just for statically-selected libraries from statically-selected languages.
In this report, it is described a simple solution, named QuantityEr 16 , to obtain the search results number of complex queries directly to GitHub.The proposed design was conceived with the aim of extension in mind, in such a way that it would be possible to incorporate the ability to interact with other similar platforms besides GitHub as well as other queries languages and algorithms for obtaining the amount of search results.
The current document is structured in the following manner.Section exposes some mathematical definitions and concepts necessary to understand the proposed solution.Section describes the proposed solution as well as some usage examples.Section makes the final remarks and conclude.

Mathematical background
In order to understand the proposed solution, some mathematical background is necessary.To archive a selfcontained report, in this section is mentioned the principal mathematical concepts used in the design of the solution.The following definitions (or equivalent ones) as well of other complementary concepts and profs can be found in the cited references [9][10][11][12][13][14][15][16][17].
The following notations will be used in this report.
• ℘(A) denotes the power set of a set A, that is the set of all subsets of A.
• |A| denotes the cardinality of a set A, that is the number of elements in A.
• ∅ denotes the empty set.

Boolean algebras
The first essential concept important to the design of the proposed solution is that of Boolean algebra.Definition 1.A Boolean algebra is a tuple (S, +, •, , ⊥, ) where S is a set containing distinct elements ⊥ and , + and • are binary operators on S and is a unary operator on S. Every Boolean algebra satisfies the following laws for all x, y, z ∈ S.

Commutative laws:
x Identity laws: Associative and idempotent laws, as well as other laws can be also considered since they follow from the definition laws.Furthermore, other useful operators can be derived from the previous ones [12,14,16].
Fact 1.In a Boolean algebra (S, +, •, , ⊥, ) the following laws are satisfied for all x, y, z ∈ S: Facultad de Ingeniería Universidad La Salle, Arequipa, Perú facin.innosoft@ulasalle.edu.pe Associative laws: Boolean algebras are used to model operations over the elements of a set that relates two elements with the maximum (+ operation) or the minimum (• operation) of both elements in a partial order where the minimum and the maximum are ⊥ and , respectively.In other words, a partial order ≤ can be defined over S where Also, intuitively speaking, all the elements have an associated complement counterpart that together from the maximum but apart form the minimum as stated in the complement laws.
This is the most elemental Boolean algebra and is the one found in classical binary logic that has applications in several areas of computer sciences [10,[12][13][14].
In this specific work, the last two described Boolean algebras are crucial because the current problem is to find the number of objects that makes true a logical sentence.In this context, the logical sentence is the query to be issue to the platform.The proposed solution takes advantage of the equivalences between classical logic and set theory in the context of Boolean algebras to solve this problem.

Boolean functions
In some contexts, the combination of operations in the set {0, 1} are called Boolean functions.The following definitions relate to this subject.This concept has wide application in logic gates circuits design.In this topic one of the main problems is the simplification of Boolean expressions [9,12,14,16].
In the case of this work, these are of great importance because, as we will see, each query has an associated Boolean expression.The objective is to simplify it in order to obtain an expression that involves less computation.
The simplification of a Boolean expressions may be done symbolically by applying the laws of a Boolean algebra (definition 1) but also by applying specific methods that simplify an equivalent form of the expression. of the minterms in order to find those that are essential to represent the value of the expression.It is known that it does not performance well when the size of the input, in this case the expression to simplify, is big.In fact, the problem of simplification of Boolean expressions is considered NP-hard [12,14,16].
However, the simplification of a Boolean expression is steel of great importance to this work, because small queries are preferable to big ones.
It obvious that a predicate has an associated Boolean expression if each atom is replaced by a Boolean variable.Definition 6.The expression S = x | P (x) is equivalent to x ∈ S ⇐⇒ P (x) [11].
The following theorem will be useful in the modeling of the solution.
Theorem 1.The following relations are satisfied for any Demonstration 1. Proof follows directly from fact 3 and definition 6.
This relations may be easily understood, since if A contains all the elements x such that P (x) = 1 and B is all the elements x such that Q(x) = 1 then it follows -from the definition 6 and the definition of union in the fact 3-that A ∪ B will have the elements x such that P (x) ∨ Q(x) = 1.The same analysis can be done for the intersection and complement cases.

Inclusion-exclusion principle
First let consider the cardinality of the power set.This will be useful later in the description of the proposed solution. Facultad The inclusion-exclusion principle (IEP) is a mathematical formula that can be used to obtain the cardinality of the union of finite sets taking into account the cardinality of all possible intersections of the given sets.
Fact 5 (Inclusion-exclusion principle).The cardinality of the union of sets S i∈{1,2,...,n} is The number of every possible intersection of n sets is the same that the number of subsets of a set of n elements without counting the empty set.This leads to the following fact taking into account fact 4.
Fact 6.There are terms in the inclusion-exclusion principle formula for n sets.
This means that an algorithm that calculates the cardinality of the union of n sets by directly using the IEP have an exponential complexity [15,17].
In the proposed solution the IEP is used to decompose a given query in many smaller sub-queries that will be issued to the platform search API.In the next section, will be shown how to manage the problem of the exponential complexity when using this method.

Results and discussion
The problem to solve is: How to get the results number of complex queries to GitHub?
The proposed solution follows a divide and conquer approach as follows: 1. Simplify and decompose complex queries into smaller simple sub-queries.

Facultad de Ingeniería
Universidad La Salle, Arequipa, Perú facin.innosoft@ulasalle.edu.pe a query must be designed with care in order not to exceed the restriction that GitHub Search API imposes in the number of operators.
After the sub-queries have been sent, the next step is to find the results amount of the main query by applying IEP (fact 5).The problem with this approach is that the number of terms -according to fact 6-in IEP formula with n sets is 2 n − 1, which is the number of sub-queries to be issued to the server.
However, each term in IEP is of the form of an intersection.Moreover, the terms in the expression associated to the DNF are also in the form of intersection.Then, by applying fact 1, that it is possible to reduce each term of the IEP formula so that some terms might be repeated afterwards.For this reason, it is proposed to use a cache for storing already issued queries as well as its respective results quantities in order to reduce the number of issued queries.However, work still need to be done to accelerate the computations of the terms in the IEP formula.

Solution design
QuantityEr is designed by using the object-oriented paradigm.Care on extension has been taken from the beginning by assigning a class to each sub-process in the solution.In figure 1 is outlined the class diagram of the most important classes.The classes are given as abstract base classes, so they must be extended for a particular problem.Currently, the extensions for solving the problem in the specific case of GitHub are implemented.Next, it is briefly described each class.
Main: Coordinate the interaction between the Input, Engine and Output classes objects.That is, the main algorithm is implemented inside this class.
Input: Currently, the queries can be presented to QuantityEr from two sources: the command line and files.
Several queries can be presented to the application in one single execution.The responsibility of this class is to present these sources as a stream to the Parser.Since the logic of the input is encapsulated in one class, other kind of inputs may be added in the future like, for example, inputs from the network.
Parser: Translate the queries presented as input to a standard language that can be managed by the other entities.Since the logic of parsing is encapsulated in one class the syntax of the language used in the input queries do not need to be like the one expected by GitHub.This may ease the input allowing a cleaner syntax.In table 1 and figure 2 can be seen that the number of sub-queries depend on the ability of the Python's 18  19 [18] Sympy 20 21 [19] library to simplify the given expression.Also, in this case, the presence of the cache effects a great reduction on the number of issued queries, especially when the number of sub-queries is big.

Conclusions
In this report a tool, named QuantityEr, to obtain the results number of complex queries to GitHub search API has been described.The application uses the inclusion-exclusion principle and other mathematical abstractions to decompose the query in several simple sub-queries.The application uses a cache in order to reduce the number of sub-queries issued to the server.Even though it is considered that the use of the cache improves the solution and makes it viable, more work may to be done in order to accelerate the computations of the IEP formula terms.Moreover, the application may be extended to resolve other restrictions problems in GitHub and other platforms.

Definition 2 (
Boolean function).A Boolean function of degree n is a function f : {0, 1} n → {0, 1} where f is an atom (a single variable or value) or a composition of the operations ∧, ∨ and ¬ of the Boolean algebra {0, 1} , ∨, ∧, ¬, 0, 1 .This composition is called a Boolean expression, and the variables of the Boolean expression are called Boolean variables.

Figure 2 .
Figure 2. Sub-queries amount.Total vs Issued x 2 , . .., x n ) .Definition 4. A normal form of a Boolean expression f (x 1 , x 2 , . .., x n ) is an equivalent Boolean expression in the form g(x 1 , x 2 , . .., x n ) = t 1 * t 2 * , . .., * t m where each t 1≤i≤m is in the form y 1 y 2 , . .., y k≤n and each y 1≤j≤k is in the form x k or ¬x k where 1 ≤ k ≤ n.The Quine-McCluskey algorithm is one of such methods that uses the normal form of a Boolean expression, specifically DNF, to obtain an equivalent minimal expression.The algorithm, in essence, test combinations When * is ∧ and is ∨ the normal form is called conjunctive (CNF).Similarly, when * is ∨ and is ∧ the normal form is called disjunctive (DNF).Additionally, when the normal form is conjunctive each t 1≤i≤m is called a maxterm.Similarly, when the normal form is disjunctive each t 1≤i≤m is called a minterm.Facultad de IngenieríaUniversidad La Salle, Arequipa, Perú facin.innosoft@ulasalle.edu.pe de Ingeniería Universidad La Salle, Arequipa, Perú facin.innosoft@ulasalle.edu.peVol. 1, No. 1, Mes Marzo-Agosto, 2020 ISSN: 2708-0935 Pág.11-25 https://revistas.ulasalle.edu.pe/innosoftFact 4. The cardinality of the power set of U is

:
Represents the intermediate language that the other classes understand.All the queries inside the application are in this format.In this case, the queries ask for the amount of source codes that use the classical synchronization mechanisms defined in the asyncio, multiprocessing and threading Python libraries.The results are summarized in table1 and figure 2.The command lines options to the program, the actual output, the presented queries as well as other execution example can be found in attached document examples.html17.