Bottlenecks for open source#

LAST UPDATE: June 9, 2023

Below there are arguments brought up against the use of open source software or to share code grouped by major topics, together with counter arguments and best practices to mitigate those bottlenecks and the relevant principles.

1. Quality#

Reasoning#

IT security

IT department is reluctant to use open source software because of unknown quality and potential security issues, no direct support in case of security breach.
IT department claims proprietary solutions sometimes provide better quality and better/longer support and support in case of emergency. This creates the feeling of safety and trust towards the software provider when collaborating with well-known brand names (with a lot of years of experience in the field).
OSS tends to have too fast release cycles. It takes time to validate the security of all OSS modules and libraries and acquiring security testing/auditing for every release is not feasible if there are releases coming every month or even more frequently (proper security testing requires about 2-3 months to complete).
Incidents where OSS has been modified to broadcast data or introduce trojan vulnerabilities are becoming increasingly common. Thus, intrusion detection, prevention and recovery systems are even more important than they used to be. Unfortunately, OSS in that field seems to be missing (and what is available is usually very limited in its abilities).
The published code can contain hard-coded credentials/secrets by accidents.

Code quality

Exposing poor quality code, which does not meet industry standards, could have a negative impact on the image of the institution.
The code is poorly documented and no-one has an appetite to document it which could help the reuse of the code.
The code/software don’t have tests (unit tests, regression tests, stress tests, …) in place. If a code is published, then the publisher has to invest in proper tests scenario(s).
To ease the installation of the software the creation of packages (CRAN, PyPi, Conda, APT…) needs additional effort.
The code is not general, only fits the specific data structures and/or environment of a National Statistical Institute (NSI) (hard-coded configuration). It is difficult to change the code to fit the needs of other NSIs.

Counter arguments#

IT security

There is a common perception that proprietary software is more secure than open-source software. As well as OSS, commercial software are licensed « as is », end users bearing the risks of using them. Therefore, if an IT security officer considers that OSS software should be validated, proprietary ones should be validated too using exactly the same procedures. Obviously, OSS are much more easy to validate compared to closed source software. Moreover, proprietary software integrate more and more frequently telemetry functionalities which can be seen as a threat from a security perspective.

The risk of vulnerability varies from one OSS project to the other, and depends on the means dedicated to the project, the governance structure and the existence of a process to review code. One way to reduce the risk associated with OSS is to restrict to the use of code that are thoroughly reviewed ([1],[2]). There are open source tools for automatic code review and vulnerability checking like OWASP and SonarQube.

To tackle the challenge of maintenance and shorter release cycles for open source software, there are third party commercial Open Source providers that provides the service of making sure software is secured and have long term support.

Additionally, some commercial companies/public institutions have “bug bounty programs,” that are designed to encourage researchers to review code and report vulnerabilities in a company’s software or systems. In exchange for finding and reporting a vulnerability, the researcher is typically rewarded with a cash payment([3],[4]). The existence of a bug bounty programme doesn’t guarantee the absence of security flaws, but it does show that attention is being paid to the quality of the software.

An option to handle security concerns is to use stable packages of verified (by international standards) OSS libraries and modules. A possible future would include using pre-build virtual machines that have been pre-configured for maximum security to be used in specific statistics process steps. This includes tools for data generation, anonymization, pseudonymisation and other purposes that allows statistical work using grey-box (or black-box) methods. Another option is to create private-computation environments that might require changes in national legislations and possibly monetisation schemes.

A relevant component of the IT security is cryptography that has an important role when it comes to the security of software, as it is used to protect sensitive data and communications. Proprietary and undocumented cryptographic algorithms are considered less trustworthy than open and well-documented algorithms, as it is difficult to verify the security and reliability of proprietary algorithms without access to their source code.

Open and well-documented cryptographic algorithms (like RSA, AES, …), on the other hand, are typically reviewed and analyzed by a wider community of experts and are subject to more rigorous testing and evaluation. This can help ensure that these algorithms are secure and reliable, and can make them more trustworthy for use in sensitive applications.

One way to ensure the reliability and security of cryptographic algorithms is to use algorithms that are documented in scientific journals and peer-reviewed by experts in the field. These algorithms have typically undergone extensive testing and evaluation, and their security and reliability have been demonstrated through rigorous analysis and experimentation.

Overall, it is important to choose cryptographic algorithms carefully and to consider their security, reliability, and documentation when selecting software. Open and well-documented algorithms can be a more trustworthy option for protecting sensitive data and communications.

Some open-source implementations of open algorithms are certified by security agencies. Eg: https://www.ssi.gouv.fr/entreprise/certification_cspn/winpt-version-1-4-3-et-gnupg-version-1-4-10b/

Finally, the safeguard for the quality and security of OSS is the check of MD5 sums or signature of the reviewed/certified code.

Code quality

Having common guidelines can improve code quality. Those guidelines shall include:

The use of environment variables for passwords
Functions and procedures are used to avoid code duplication
Configuration files are separated from the functions and procedures
Configuration files are excluded from code versioning thanks to a .gitignore file
Have a clear and concise documentation

Having code reviewed by other NSIs or experts in the field before publication can be an effective way to ensure that the code is of high quality and to minimize the risk of harm to the NSI’s reputation. Code review can involve having other developers or experts review the code to identify any issues or concerns, as well as running automated tests or using other tools to ensure that the code meets certain standards. Automated tests and code review can minimize the disclosure of sensitive information risk via code. In addition, by engaging with the open-source community to seek feedback and support can improve the quality of the review.

One way to implement peer review of open-source code can be to assign “stars” to repositories based on some criteria. For example the code can get a star:

if the license is clear and permissive
if the code is documented
if the code is packaged
if the project implements automatic tests like unit tests, regression tests, stress tests
…

Documentation

Writing documentation is often seen as a tedious and time-consuming task, and it is not uncommon for developers to prioritize writing code over writing documentation. This can be especially true in open-source projects, where developers may be volunteering their time and may not have the same incentives or pressures to write documentation as they would in a professional setting.

The developers can be encouraged to write documentation by using a documentation generator, as well as providing guidelines and templates for documentation and offering training or support to help developers get started. Additionally, it can be helpful to recognize and reward contributions to documentation, as this can help to motivate developers to put in the effort to write high-quality documentation.

Documentation generators promoted by the ESS could be, for example:

Doxygen: Doxygen is a popular documentation generator that can be used to generate documentation for code written in multiple programming languages, including C++, C, Python, Java, and others. It can generate documentation in a variety of formats, including HTML, LaTeX, and PDF.
Sphinx: Sphinx is a documentation generator that is primarily used for generating documentation for Python projects, but it can also be used to generate documentation for other programming languages. It supports the inclusion of documentation written in a variety of formats, such as reStructuredText and Markdown, and can generate documentation in HTML, LaTeX, and PDF formats.
Templates could be in the form, for example of Doxygen CSS or Sphinx themes.

Packaging

The “re-usable generic functional building blocks” in Python, R, … can be easier reused if it is packaged and distributed through the relevant platform (CRAN, PyPi, Conda…). The central distribution can help to ensure that the code is easy to install and use.

Packaging code can be a complex process that involves many different steps, such as defining dependencies, creating a package structure, and building and distributing the package. It can also require familiarity with specific tools and technologies, such as package managers and version control systems.

There are many resources available to help organizations learn how to package their code, including online tutorials and documentation, as well as tools and libraries that can assist with the process.

Relevant principle(s)#

2. Institutional environment#

Reasoning#

Skill gap

Publishing open-source code can be a challenging task for an NSI if they do not have the necessary skills. Generally, statisticians do not have the skills to publish code and/or have the expertise to create software packages.

Existing proprietary solutions

Existence of legacy systems, which have been developed with commercial products, reluctance to develop and maintain a second parallel system.
Fears that by opening up methodologies and tools that have been applied since many years, latent problems may be exposed that will ultimately harm the NSI reputation.

Organisational inherence

Time and effort needed to replace the existing technology and find which open source tools are suitable for the NSIs’ needs
Difficulty in changing the way of thinking/working for employees with a working background of 20 or 30 years. Difficulty in espousing new ideas/technologies that affect them and lack of new employees to take part in the transformation.
Lack of motivation for the changes as they pose an extra work load for the employees in order to learn new programming languages, how to use the new software etc.

Counter arguments#

In addition to the incremental innovation, the ESS is called today to take bigger innovation leaps in certain areas. The ESS must develop entirely new statistical products and methodologies based on non-traditional data sources (including but not limited to so called “big data”) and make use of digital tools that fall outside the traditional toolset of official statistics (e.g. Machine Learning). In doing so, the NSI will have to:

Rely on an increasing share of young staff members, i.e. fresh recruits from universities and from a more diversified range of educational backgrounds, namely “data science”, computer science and STEM (Science, Technology, Engineering and Mathematics) disciplines in general;
Engage with a broader set of scientific communities, beyond classical statisticians, including computer science and other STEM communities;
Engage with start-ups and SMEs to procure innovative technology solutions for the provision of tech-based products and services.

As a matter of fact, OSS is already a standard practice in most of the research and innovation communities since more than a decade. Also, young STEM graduates are being increasingly trained to work with OSS tools, to adopt, develop, release and share OSS software. In other words, young STEM graduates are already “OSS natives” and have been trained in an OSS oriented culture where code sharing and openness are considered as strong positive values. Such youngsters are not only able and skilled to work in an OSS-oriented environment, but they actually expect to do so in their career. This is true for the youngsters that get recruited by NSI, and then will form the producers of future statistics but also for those that will work in other organizations and entities on the sides of statistical users (universities, government agencies, etc.).

In the future, on the production side, it will be increasingly difficult for NSI to recruit talented young staff members that accept to work in closed proprietary non-OSS environments. On the user side, it will be increasingly difficult to assert the legitimacy of official statistics based on closed code that is not made openly accessible, in a world where release of OSS is the norm.

When developing new products, methodologies and processes, “going open source” improves the ability of NSI to collaborate with external researchers, to reuse their research results, and to recruit OSS natives from universities. When a new methodology and process is developed natively OSS, benign methodological mistakes and points for improvement are detected and fixed, in the development, pre-production or early production stages, without creating the reputational risks that may exist for legacy methodologies and process that transit to OSS at a mature stage. Therefore, the (potential) gain in methodological quality comes without any reputational risk if the process is iterative (agile) and open. On the contrary, there is a reputational gain in making methods and code openly accessible right away, in line with the standard practice in the scientific communities that value positively reproducibility and auditability[5]. It should be remarked also that other international institutions are increasingly adopting the same practice.

Training and capacity building: there are many open and freely available courses on the internet to help develop the skills and expertise needed to publish open-source code. There are also ESTP courses in this field.

Collaboration: there are already existing ESSnets that help many NSIs to collaborate with more experienced NSIs peers to cooperate on open-source code.

Use of open-source tools and platforms: NSI can also make use of open-source tools and platforms (like GitLab, GitHub, …) that can help streamline the process of publishing open-source code. These tools can help automate tasks such as version control, code review, and documentation, which can help make it easier for NSI to publish open-source code.

Support from the open-source community: one of the advantages of open-source is that it benefits from a large community of developers and users who can help troubleshoot problems and improve the quality of the software.

Support from the senior management can help to reduce the internal obstacles before the use of OSS and share of code.

Relevant principle(s)#

2. Work in the open

3. Cost#

The licenses are already paid for several years for the commercial products.
Cost of migration is considerably high and there may not be adequate human resources as well.
OSS does not necessarily mean “free” software, since for professional editions there are also fees, so the cost benefit is not necessarily as high as anticipated.
Additional resources needed for publishing code and to offer support, which could be difficult because of the lack of additional resources.

Counter argument#

Support is not strictly required: open-source licenses do not include any requirement regarding the quality of OSS or any obligation to provide support to end users. Of course, minimal support is still desirable to limit the risks of critical CVE and to encourage collaboration.

Support contracts are cheaper than software license and support: many open-source software publishers (or consulting companies) offer support contracts that provide similar levels of support and assistance as proprietary software. These contracts can provide users with access to expert support, bug fixes, and updates to improve the quality and reliability of the software. By using OS, the cost of the “lock-in effect” is smaller, cheaper to switch to an alternative compared to a situation when we rely on long term licensing of a commercial software. The cost and the changeover should be analysed in a long-term perspective, as an investment, by comparing benefits and costs of alternative IT solutions; existing IT software can be ‘grandfathered’, used until the licence expiry time, and OSS solutions could be introduced stepwise, starting from new projects with new staff.

Cooperation with the private sector and research sector: can share the burden and reduce the costs of development and maintenance.

Relevant principle(s)#

3. Improve and give back

4. Licensing#

If an NSI publishes code, then the NSI have to review all the licenses of all the packages/modules/… that is used so as to be sure that no license is violated.
The NSIs are limited in the licensing “permissiveness” by the actual legislature.

Counter argument#

To mitigate the risk of license violations and have clear licensing terms:

Keep track of the libraries you are using
Review the licenses of all the libraries you are using
Consider using a tool like FOSSology to help manage open-source licenses.

Relevant principle(s)#

6. Choose permissive

Bottlenecks for open source

Contents

Bottlenecks for open source#

1. Quality#

Reasoning#

Counter arguments#

Relevant principle(s)#

2. Institutional environment#

Reasoning#

Counter arguments#

Relevant principle(s)#

3. Cost#

Counter argument#

Relevant principle(s)#

4. Licensing#

Counter argument#

Relevant principle(s)#