I recently had the opportunity to participate in FIRST CTI Conference in Berlin, where I talked about how OSINT is not always as open as it might seem. Since the presentation included the methods and sources of specific analysts, I preferred to stay with TLP Green. However, in this post I would like to present the main theses and problems that I discussed.
The inspiration for the presentation was article published in Foreign Affairs wherein Amy Zegart argued that the United States should establish an intelligence agency focused on obtaining and analyzing information from open sources. I then began to wonder what range of skills required to work comprehensively on open sources and whether their openness always means that they are actually available to anyone.
If we start with the situation of OSINT at the level of state intelligence agencies, it is also worth noting here that the scope of information in the circle of interest will be different for such intelligence and for CTI in the private sector. In the context of open sources, Samuel Wilson's words are often quoted that 90% information used by the US intelligence community comes from such sources. And in the context of the tasks assigned to agencies such as the CIA, this is not so difficult to imagine. If we look at political and economic intelligence, analysis of the international situation of a given country it is actually publicly available reports or media reports that can provide sufficient material for analysis. In the case of CTI, however, the situation may be completely different. Consumers of CTI products, such as broadly understood security operations or incident response teams, often expect information at the tactical or operational level, which can directly translate into methods of detecting and preventing activity. Therefore, they will be interested in more "unique" information that is not so easily publicly available and which will complement what is already detected in the form of signatures or public feeds.
In my opinion, in OSINT, as in other intelligence disciplines, there are "hard targets" that require specialized skills and preparation to use them appropriately. And in this respect, OSINT can be no less demanding a discipline than seemingly more "technical" fields such as malware analysis. These sources highlight the existence of one or more factors that complicate data acquisition and analysis:
- Linguistic – sources requiring knowledge of the language, especially if interpretation requires knowledge of specialized vocabulary, where automatic translation is not very effective.
- Cultural - the interpretation of information requires understanding the cultural context, the relationship between the authors of the source and the phenomena or slang described.
- Technical – the source is publicly available, but access requires specific circumstances, e.g. connection from a given country.
- Operational – the source requires technical or manual effort to obtain information, e.g. a forum that cannot be automatically scraped, so analysts must manually analyze threads.
- Legal – possession of materials obtained from the source is legally sanctioned. This may be the case, for example, with materials originating from criminal or terrorist groups.
A striking example of this phenomenon is the analysis of sources relating to China. Even widely available websites, forums and documents are inaccessible to most analysts due to language and cultural barriers. If we want to read one of the main documents of the PLA doctrine - the Science of Military Strategy - we can find it, for example, at website of the US Air Force University. However, even here, as noted, the English version is the result of automatic translation and may contain inaccuracies. And the level of difficulty increases exponentially in the field of forums and social media where automatic translations may not convey the proper context of the phrases used.
The best sources are therefore projects dealing with specific areas of the issue. In the case of China, we can point to: China Law Translate, publishing translations of Chinese legislation, articles Rick Joe (PLARealTalk) tracking the development of China's naval and air forces, or project to trace connections between Chinese universities and the national security apparatus hosted by the Australian Strategic Policy Institute. Using such projects, we outsource part of the analysis related to the selection of information sources and their development in order to understand the context. In this way, we may not use the entire range of sources (because we limit ourselves to those selected for us), but this disadvantage is usually offset by the added value of specialized knowledge that we do not have to acquire ourselves. Naturally, the selection and verification of these expert sources remains a problem, but this is an element of every intelligence activity.
Challenges of a different nature are related to the legal aspects of information collection. In the case of some categories of materials, the mere possession of them is penalized, which limits the possibilities of analyzing them and exposes analysts to additional risk. An example of this phenomenon are materials originating from terrorist groups - such as the Inspire magazine published by Al-Qa'ide of the Arabian Peninsula.
The magazine published both ideological content and guides on weapons creation and safe communication, so it was a potentially valuable source of information in understanding the goals and methods of the organization. However, in the UK the mere possession of Inspire is prohibited and there have been judgments on this matter under Terrorist Act 2000. Moreover, even the sister of a person involved in terrorist activities, neven when the court assumed that she used the materials only to understand her brother's motivations. The regulations contain exemptions for journalistic and academic activities, but it is difficult to determine whether analysts working for threat intelligence teams are covered by this protection, and testing it in practice may prove to be very expensive.
How to approach the challenges related to sources that turn out to be not so open? The solutions will fall on a spectrum of how much resources and man-hours we can devote to related tasks. The most engaging solution, but one that provides the greatest opportunities for obtaining information, is a dedicated team responsible for gaining access to and processing the acquired data so that analysts can work on ready-made reports and conclusions and draw final conclusions on this basis. Such a team could include:
- Linguists and people with knowledge of source languages to faithfully translate the collected information.
- Access specialists dealing with, for example, creating reliable profiles on closed forums and creating network infrastructure, such as VPS.
- Legal and operational supervision assessing the risks associated with obtaining data or participating in discussions on criminal groups' channels.
However, such recommendations are usually not very helpful - after all, simply hiring additional staff may be good advice, but it is not very practical. The most common situation will be intermediate states where analysts will use tools such as automatic translation, OCR to analyze texts written in a foreign alphabet, or probably increasingly GAI such as ChatGPT.
As an example of such a workflow, we will look at searching academic sources from China in the field of network security, trying to determine the directions and level of research development in this field. Let's start by trying to identify keywords that we could use in searches. We will use DeepL here:
In this example, we assume that we do not speak Chinese at all, so we will simply use the suggested phrase. To immediately narrow our search to PDF files, we will use the "filetype:" operator.
The first result leads to an article posted on the website of the software security research group of Peking University, so we go to the right places. Integration of automatic translation with the browser helps in quick triage of results - we can even translate articles on the fly.
Once we know that we are in the right areas, let's try to obtain more information about research facilities that might interest us. For this purpose, we will use the previously mentioned project Australian Strategic Policy Institute, which catalogs information about universities linked to China's national security apparatus. The authors graphically presented the network of connections between individual ministries or types of armed forces and research institutions:
We can also go back to the post on counterintelligence.pl in which I described the PLA's cyber activities, including the concept of civil-military fusion and the role of the Strategic Support Force. At the junction of the Ministry of Education and the Air Force we find Wuhan University, strongly associated with network operations:
On the website dedicated to him, you will find information about his activities as well as the original name, which we will use in further searches.
Let's combine it with the earlier vulnerability phrase and filetype operator:
The first result from the infosec.org domain indicates that we are still moving in the right area. By further browsing through the items, we will find the result that the URL is the IP address, which always arouses additional curiosity. When viewing the content thanks to automatic translation, we will see references to the Strategic Support Forces and prizes, probably related to the competition for students.
At this point we can also look at the infrastructure from a technical perspective and search for an address in Shodan:
As we can see, the server is actually located in China and is part of the Huawei public cloud. To confirm this, we can look at the data about the autonomous system and who it belongs to:
To search for further related documents, we can again use the classic Google search operators and narrow the results to a given page:
It turns out that on the server we will find lists of finalists of competitions for students in the field of information security:
Browsing through the content of the documents, we can find projects created by students of the People's Liberation Army Strategic Support Forces Polytechnic:
In this way, for example, analysts dealing with the strategic dimension of threats related to the region can assess the interests of the People's Liberation Army in cybersecurity projects. We could continue our search by looking for academic publications associated with specific research institutes. At this point, however, let us summarize what allowed us to go through this simple example of searching for Chinese sources without knowing the language:
- Machine translation tools. Of course, if we do not know the language, our greatest ally will be tools that will allow us to overcome the language barrier and understand at least some of the content. Integration with system functions and the search engine that allow you to quickly translate selected text regardless of the source enables quick filtering of search results and downloaded content.
- Understanding the ecosystem and context of the information you are looking for. While translation tools can provide a bridge to leapfrog language knowledge, it's difficult to conduct data collection and analysis if we don't understand where and how to start looking. In the example given, we already had to know about the relations between research and government institutions in China, the concept of military-civilian fusion and the role of the Strategic Support Forces.
- And here we move smoothly to the third tool - external expert sources. While understanding the context is crucial at the beginning of the search, the more we delve into the search, the more important the support of specialized publications will be. In the model approach, they should support the interpretation of raw data as source documents. In our example, using the help of ASPI, we were able to select scientific institutions related to cyber operations, which helped narrow the search.
- After all, we cannot forget about the importance of universal tools like Google dorks. Quickly filtering and matching results significantly speeds up your work and allows you to focus on analysis.
OSINT can be as demanding a field as seemingly much more technically complex activities such as malware analysis or interpretation of post-intrusion analysis results. As a discipline, it is also extremely "capacious" - in fact, a completely different set of knowledge and skills can be presented by analysts who deal with using satellite images and local press to assess the situation on the front or establishing networks of connections between groups of people. OSINT, like HUMINT or IMINT, is only a methodology of working with sources.
So what do I think the role of OSINT will be within CTI and how should we approach "hard targets"? The most important thing will be whether the team is staffed well enough to be able to separate the functions of data collection and analysis. Specialization in access to information, collecting and cataloging materials allows analysts to improve tools and methods, which translates into the ability to maintain "eye contact" with the most important events and activity groups and related information. Going further, teams can build automatic and scalable source scraping tools internally, allowing analysts to focus on tasks that cannot be easily automated. However, in the practice of most teams, where the functions of analysis and data acquisition are combined, the most effective division into thematic areas of responsibility will be. So that, for example, a person dealing with operations related to China from the technical side should understand the organizational context of China's secret services or the strategic goals of the Chinese government. Such specialization combined with joint work on work automation tools will allow analysts to develop the knowledge necessary to reach difficult sources and translate it into scalable solutions. It remains an open question whether we will see the development of a separate career path for OSINT analysts within CTI teams. In my opinion, not in the near future, because the industry still suffers from a shortage rather than an excess of labor. In this context, "pure" OSINT analysis does not, in most cases, bring benefits that outweigh the costs of deputizing additional people to do it. What is definitely already happening is sector specialization enabling teams to focus their efforts on specific areas. An example is the job advertisement for a senior analyst at CrowdStrike, which I saved some time ago:
And this approach will certainly result in a much better understanding of the topic and the possibility of a comprehensive analysis of events.
As we can see, not all OSINT sources are created equal. They will differ in the scope of information, but also in the threshold of skills and technical possibilities necessary to discover it. The assessment of how much we will spend on them will depend on the requirements for the team and the goals in terms of understanding the context of cyber operations, attribution and the intentions of the perpetrators. The work of analysts will certainly be made easier by the tools we mentioned, but ultimately the best teams will perform best in teams whose specialization will allow members to expand the knowledge necessary to thoroughly understand the information ecosystem in which they work.