Makea summary and PowerPoint
prepare a summary of each chapter
then make PowerPoint slides for each chapter. (Make the PowerPoint for each chapter the most salient points of the article with more detail.)
Analytics for Finance and Accounting:
Data Structures and Applied AI
Sean Cao
University of Maryland
Wei Jiang
Emory University
Lijun Lei
University of North Carolina
at Greensboro
Contents
Preface…………………………………………………………………………………………………………………………………………………..1
Forewords ……………………………………………………………………………………………………………………………………………..2
Chapter 1 Data Analytics in Finance and Accounting ………………………………………………………………………………4
1.1. How to leverage data science for corporate stakeholders…………………………………………………………………..4
The rising use of big data for decision-making…………………………………………………………………………………………………………………… 4
How to leverage data science for corporate stakeholders: the importance of domain knowledge……………………………………………….. 5
1.2. Overview of structured and unstructured data……………………………………………………………………………….. 12
An overview of available business data …………………………………………………………………………………………………………………………… 12
An overview of unstructured data analytics……………………………………………………………………………………………………………………… 12
An overview of structured data analytics ………………………………………………………………………………………………………………………… 13
1.3. Theory-driven and machine-learning approach of data analytics …………………………………………………….. 16
Theory-driven and machine-learning approach of data analytics ………………………………………………………………………………………… 16
1.4. The advantages of applying machine-learning approaches ……………………………………………………………… 19
References ………………………………………………………………………………………………………………………………………. 21
Chapter 2 Analyzing Annual Reports …………………………………………………………………………………………………… 22
2.1. Data structure in annual reports and 10-K filings ………………………………………………………………………….. 22
Data structure of the 10-K filing…………………………………………………………………………………………………………………………………….. 22
2.2. Conventional textual analysis approach ……………………………………………………………………………………….. 31
Bag of words and LDA ………………………………………………………………………………………………………………………………………………… 31
Bag of words and LDA ………………………………………………………………………………………………………………………………………………… 36
2.3. Empirical examples: Analyzing corporate filings for making business decisions ………………………………. 38
An empirical example of analyzing 10-K filings ………………………………………………………………………………………………………………. 38
Appendix 2A: Project 1a How to crawl annual reports ………………………………………………………………………….. 42
Appendix 2B: Project 1b How to parse unstructured data ………………………………………………………………………. 42
Appendix 2C: Solution ……………………………………………………………………………………………………………………… 43
How to crawl annual reports………………………………………………………………………………………………………………………………………….. 43
How to parse unstructured data ……………………………………………………………………………………………………………………………………… 47
References ………………………………………………………………………………………………………………………………………. 49
Chapter 3 Emerging AI Technology in Textual Analysis ……………………………………………………………………….. 50
3.1. Procedures for applying machine learning models ………………………………………………………………………… 50
3.2. Pre-trained phrase-level word embedding …………………………………………………………………………………….. 58
Pre-trained phrase-level analysis ……………………………………………………………………………………………………………………………………. 58
3.3. Pre-trained sentence level-word embedding …………………………………………………………………………………. 60
Pre-trained sentence-level analysis…………………………………………………………………………………………………………………………………. 60
3.4. Reinforcement learning ……………………………………………………………………………………………………………… 66
3.5. Competitive advantages of man and machine ……………………………………………………………………………….. 67
Appendix 3: Evaluating machine learning models ………………………………………………………………………………… 69
References ………………………………………………………………………………………………………………………………………. 70
Chapter 4 Analyzing Earnings Conference Calls …………………………………………………………………………………… 71
4.1. Data structure in earnings conference calls …………………………………………………………………………………… 71
Data structure of conference calls ………………………………………………………………………………………………………………………………….. 71
4.2. Standard dependence parser ……………………………………………………………………………………………………….. 73
4.3. Empirical example: Measuring corporate culture using machine learning ………………………………………….. 77
4.4. Empirical example: From words to syntax: Identifying context-specific information in textual analysis . 78
Appendix 4: Applying GPT to analyze conference call transcripts using both API and web interface ………….. 81
Applying GPT to analyze conference call transcripts ……………………………………………………………………………………………………….. 81
Chapter 5 Analyzing Material Company News ……………………………………………………………………………………… 83
5.1 Data structure in 8-K filings ……………………………………………………………………………………………………….. 83
Data structure of the 8-K filing ……………………………………………………………………………………………………………………………………… 83
5.2. Empirical example: Technological peer pressure and product disclosure ………………………………………….. 93
5.3. Empirical example: A game of disclosing “other events”……………………………………………………………….. 94
References ………………………………………………………………………………………………………………………………………. 96
Chapter 6 Analyzing Data from Social Media ……………………………………………………………………………………….. 97
6.1. What is social media?………………………………………………………………………………………………………………… 97
6.2. Data from Social Media …………………………………………………………………………………………………………….. 99
Data from social media platforms ………………………………………………………………………………………………………………………………….. 99
An example of analyzing disclosure on social media platforms ………………………………………………………………………………………….. 99
6.3. Empirical Example: Negative Peer Disclosure ……………………………………………………………………………. 107
References …………………………………………………………………………………………………………………………………….. 109
Chapter 7 Data Analytics in Environmental, Social, and Governance ………………………………………………….. 111
7.1. Corporate Governance …………………………………………………………………………………………………………….. 111
7.2. Textual data for corporate governance ……………………………………………………………………………………….. 113
Data in proxy statements …………………………………………………………………………………………………………………………………………….. 113
Data in corporate social responsibility disclosure……………………………………………………………………………………………………………. 117
7.3. Emerging technologies as governance mechanisms……………………………………………………………………… 119
7.4. Empirical example: Auditing and blockchain ……………………………………………………………………………… 124
References …………………………………………………………………………………………………………………………………….. 127
Chapter 8 Analyzing Unstructured Data from Fund Managers……………………………………………………………. 129
8.1. Mutual fund disclosure in Form N-CSR …………………………………………………………………………………….. 129
8.2. Empirical example: Extracting fund managers’ private information and risk assessment from mutual fund
shareholder reports …………………………………………………………………………………………………………………………. 134
References …………………………………………………………………………………………………………………………………….. 139
Chapter 9 Analyzing Image Data ……………………………………………………………………………………………………….. 140
9.1. Images in corporate executive presentations ……………………………………………………………………………….. 140
9.2. Empirical example: Visual information in the age of AI ………………………………………………………………. 142
References …………………………………………………………………………………………………………………………………….. 146
Chapter 10 Analyzing the Balance Sheet …………………………………………………………………………………………….. 147
10.1. Data structure in balance sheets ………………………………………………………………………………………………… 147
Data structure of the balance sheet ……………………………………………………………………………………………………………………………….. 147
Debate about fair value accounting ………………………………………………………………………………………………………………………………. 149
10.2. Empirical example: Analyzing data in the balance sheet………………………………………………………………. 150
Analyzing the balance sheet ………………………………………………………………………………………………………………………………………… 150
An empirical example of analyzing the balance sheet ……………………………………………………………………………………………………… 150
10.3. Machine learning applications for balance sheet data…………………………………………………………………… 158
Appendix 10. Regression Methods ……………………………………………………………………………………………………. 162
An overview of regressions …………………………………………………………………………………………………………………………………………. 162
Three examples of regressions …………………………………………………………………………………………………………………………………….. 162
Fama-Macbeth regression …………………………………………………………………………………………………………………………………………… 163
Two-way sorting ……………………………………………………………………………………………………………………………………………………….. 163
Risk-adjusted return sorting ………………………………………………………………………………………………………………………………………… 163
References …………………………………………………………………………………………………………………………………….. 165
Chapter 11 Analyzing the Income Statement ………………………………………………………………………………………. 166
11.1. Data structure in income statements ………………………………………………………………………………………….. 166
Data structure of the income statement …………………………………………………………………………………………………………………………. 166
11.2. Earnings and stock prices ………………………………………………………………………………………………………… 168
11.3. Empirical example: Post-earnings announcement drift (PEAD) anomaly……………………………………….. 171
An empirical example of analyzing the income statement………………………………………………………………………………………………… 171
11.4. Machine learning application on income statement data ………………………………………………………………. 177
Appendix 11. Key variable explanations ……………………………………………………………………………………………. 179
References …………………………………………………………………………………………………………………………………….. 182
Cao, Jiang, &Lei
______________________________________________________________________________
Preface
Accounting and finance students with a keen interest in applying artificial intelligence (AI)based tools in research often encounter a challenge: the need to enroll in separate programming
courses that operate independently from their core curriculum. This creates an educational void,
compelling our students to juggle the amalgamation of these two disciplines. This book addresses
this gap by equipping students with the necessary skills to integrate applied AI with domainspecific data and knowledge in the fields of accounting and finance.
Unlike conventional approaches that commence with programming training, this begins by
acquainting students with domain data, textual features, and use cases associated with these data.
Relevant emerging technologies are then introduced along with use cases. Upon completing the
book, students could be expected to apply AI and machine learning tools to generate and analyze
unstructured financial data, such as conference call transcripts, press releases, annual reports, ESG,
or other social media disclosures, product/operational images, and fund managers’ disclosures. To
enhance the learning experience, the book is complemented by a video library, featuring
educational videos corresponding to each chapter, including tutorials on how to use the GPT API.
The book offers two flexible learning approaches. Firstly, it can serve as foundational material
for the equivalent of a business school graduate-level course, emphasizing the data-driven nature
of these disciplines. Secondly, for instructors teaching traditional courses such as financial
accounting or corporate finance, relevant chapters (e.g., the chapter on annual report) can function
as supplementary material. Prerequisite coding requirements are minimized for both instructors
and students throughout the book. Even in instances where coding is necessary, tutorial videos and
exercises facilitate a nuanced understanding of coding workflows. Upon finishing this book,
students should have a clear picture of diverse use cases of AI tools in the fields of accounting and
finance. The materials in this book could also serve as preparation for students to pursue more
technical courses.
1
Cao, Jiang, &Lei
______________________________________________________________________________
Forewords
Gareth M. James
We have all seen the power of AI to impact every facet of our lives from picking TV shows on
streaming feeds, to self-driving taxis, to generative AI helping with your term paper. But how can an
individual without extensive coding skills take advantage of this technology? “Integrating Artificial
Intelligence in Accounting and Finance” addresses this challenge for students seeking proficiency in AI
applications to finance areas. This book immerses learners in domain-specific data and knowledge,
gradually introducing AI and machine learning tools. Minimizing coding prerequisites, it ensures a
seamless journey into the intersection of AI and finance, preparing students for diverse technical
challenges. By delving into unstructured financial data, from conference call transcripts to social media
disclosures, students develop practical skills. Enhanced by a comprehensive video library and
accommodating various learning approaches, this book serves as both a foundational guide for modern
data-oriented graduate-level courses and supplementary material for more traditional classes.
Kai Li
One of the first endeavors to devote an entire book on applied AI for the finance and accounting
audience by researchers who are earlier adoptors of those techniques. A must read to scholars and doctoral
students for an introduction to the increasingly important research tools of big data and machine learning.
Andrew Karolyi
The world of finance and accounting is experiencing yet another revolution. This one is about big
data analytics. And it is transformative.
Capital markets data is now everywhere. Whether in the form of data feeds from trading
platforms and exchanges, mandatory filings by corporates to regulators, or social media or news feed
coverage of those companies and markets. It is structured and unstructured, textual and audiovisual. And
there are no barriers to managing these data for competitive advantage because AI and big data
analytics are advancing as quickly as the various forms of data are being generated.
Portfolio managers, analysts, auditors, corporate financial officers and capital market regulators
know they need upskilling in tools and techniques and fast. This dizzying number of new approaches to
big data analytics need organizing principles to create logic and order. The new book by Cao, Jiang, and
Lei could not have arrived soon enough. It is well-written, accessible to general readers, and the chapters
are effectively supplemented with useful videos to showcase methods and techniques. The core of the book
focuses on distilling textual data, but my favorite is the back third of the book that showcases how
traditional data on balance sheets, income statements and stock returns can be integrated using machine
learning approaches.
2
Cao, Jiang, &Lei
______________________________________________________________________________
Acknowledgements
The authors appreciate Jackie Cardello, Will Cong, Ilia Dichev, Jennifer Disharoon, Lawrence
Gordon, Gareth M. James, Andrew Karolyi, Kai Li, Mark Liu, Vojislav Maksimovic, Lei Zhou,
and seminar participants at Renmin University for helpful feedback and comments on the textbook.
We also appreciate the support from FinTech at Cornell Initiative and Digital Economy and
Financial Technology (DEFT) Lab. Sean Cao appreciates support from the Smith AI initiative and
from GRF CPAs & Advisors.
3
Cao, Jiang, &Lei
______________________________________________________________________________
Chapter 1 Data Analytics in Finance and Accounting
1.1. How to leverage data science for corporate stakeholders
The rising use of big data for decision-making
The rising use of big data for decision-making
Data analytics is a comprehensive field covering diverse activities that involve collecting,
organizing, and analyzing raw data. Propelled by advancements in computing power, mass storage,
and machine learning, the utilization of data analytics has surged over the past decade. It possesses
the capacity to analyze various types of information, including both structured and unstructured
data.
Capital markets produce a plethora of information that is crucial for efficient contracting, risksharing, and resource allocation. The nature of a company is shaped by the structure of its
contractual arrangements with stakeholders. These contractual relations are essential to
companies, forming a network of interconnected contracts involving stakeholders such as
customers, suppliers, employees, investors, communities, and others who have a stake in the
company. In the traditional system, information provision and consumption centered on corporate
disclosure. Managers of a company decide the amount of information to supply by weighing the
costs and benefits. Although managers can exert control over what information a company supplies
and when, regulatory agencies consistently intervene in this process by establishing a baseline
level of information that must be released with various disclosure requirements.
In today’s capital markets, information can be generated by and extracted from a multitude of
sources, including and beyond mandatory filings required by regulators, companies’ voluntary
disclosures, information produced by financial analysts, shared by competitors, uncovered by news
media, etc. Accordingly, the decision-making process has evolved from a model in which
4
Cao, Jiang, &Lei
______________________________________________________________________________
managers primarily rely on their experience to a data analytics-driven approach. However, the shift
presents an inherent challenge: the necessity for large-scale data collection from different sources
and in various formats. Data analytics techniques assist stakeholders in collecting relevant
information, organizing both structured and unstructured data, and then conducting appropriate
analyses to reveal trends and metrics that would otherwise be obscured by the abundance of
information. Given the central role that information plays in capital markets, corporate
stakeholders increasingly recognize the value and importance of data analytics. As revealed in
Cao, Jiang, Yang, and Zhang (2023), there is a discernible uptick in the application of data analytic
tools in analyzing regulatory company filings downloadable from the Securities and Exchange
Commission’s Electronic Data Gathering, Analysis, and Retrieval (EDGAR) database system.
Specifically, the proportion of automatic machine downloads of annual and quarterly regulatory
company filings (i.e., 10-Ks and 10-Qs) surged from under 40 percent in 2003 to over 80 percent
after 2015 (Figure 1).
What separates us from computer science and statistics majors: The importance of domain
knowledge
How to leverage data science for corporate stakeholders: the importance of domain knowledge
If computers begin to play an increasingly important role in data analytics, an interesting
question arises: can computers outperform humans? High-profile human-computer competitions
began in chess. One of the most famous chess computers is Deep Blue because of the chess match
between Deep Blue and World Chess Champion Garry Kasparov in 1997.
5
Cao, Jiang, &Lei
______________________________________________________________________________
Figure 1. Machine downloads of 10-K and 10-Q filings
This figure plots the annual number of machine downloads (blue bars and left axis) and the annual percentage of machine downloads over
total downloads (red line and right axis) across all 10-K and 10-Q filings from 2003 to 2016. Machine downloads are defined as downloads from
an IP address downloading more than 50 unique firms’ filings daily. The number of machine downloads and the number of total downloads for
each filing are recorded as the respective downloads within seven days after the filing becomes available on EDGAR.
Source: Cao et al. (2023)
Cao, Jiang, Wang, and Yang (2021) build an AI analyst capable of processing corporate
financial information, qualitative disclosure, and macroeconomic indicators. They find that an AI
analyst built with the currently available technology could indeed beat a majority of human
analysts in stock price forecasts (Figure 2). The relative advantage of the AI analyst is more
pronounced when the firm is complex, and when information is high-dimensional, transparent and
voluminous. Nevertheless, human analysts retain their competitive advantage when critical
information requires institutional knowledge. More importantly, the edge of AI over human
analysts diminishes over time as human analysts gain access to alternative data and in-house AI
resources. Unsurprisingly, combining AI’s computational power and the human art of
understanding soft information proves to have the highest potential for generating accurate
forecasts.
6
Cao, Jiang, &Lei
______________________________________________________________________________
Data analytics techniques can be complicated and rapidly evolving. Yet, the first step in
utilizing them is relatively simple: identifying the required information and devising a strategy to
gather it. For instance, analysts must understand the objectives of the decision-making process, the
pertinent information necessary to facilitate decision-making, and potential sources of useful
information. This creates a pressing demand for business professionals who possess both domain
knowledge in business and a practical understanding of data analytics techniques. Business
professionals with both business expertise and data analytics skills can play a critical bridging role
that decipher information needs of decision-makers, conduct preliminary analyses, and lead a team
to formalize and implement quantitative models.
Figure 2. The performance of AI-assisted analyst vs human analysts
This figure plots the proportion of AI-assisted Analyst recommendations that are more accurate than the Analyst recommendations alone on
an annual basis. The blue line in the middle gives the annual AI-assisted Analyst beat ratios, the blue-dotted lines above and below are the 95%
confidence interval of the beat ratio, and the red line gives the best linear approximation of the trend in beat ratios.
Source: Cao et al. (2022)
Furthermore, the advancement of Artificial intelligence (AI) opens the avenue to various
domain-specific applications in, for example, investing, compliance, marketing, etc. AI
7
Cao, Jiang, &Lei
______________________________________________________________________________
applications do not just help in improving productivity, but also in managing associated legal and
security risks. On the other hand, some abilities, such as reasoning-based intelligence, are still
unique to humans though could be enhanced by AI. In the new eras, the development of AI will
only increase the demand for professionals with domain expertise.
Figure 3 illustrates a typical data analytics team composed of a customer team and an
implementation team. The customer team includes the client-facing product manager, who must
understand customer demand, perform preliminary analysis, and communicate both the demand
and desirable solutions to the implementation team. An ideal product manager should possess
business knowledge and be proficient in applying basic data analytics tools. The implementation
team consists of a lead team serving as the bridge between the customer team and the
implementation team, and a data team specializing in implementation. The lead team requires
professionals with strong business knowledge and data analytic skills since they are responsible
for receiving demand from the product manager, conveying data analytics solutions to data
scientists, and performing quality control. The data team mostly comprises data scientists, with
backgrounds in computer science or statistics and strong programming skills. Business
professionals with data analytics skills can excel in roles that require integrated skills, such as the
product manager and the tech lead.
The objective of this book is to introduce computational tools and AI technologies to business
major students and link these tools to business domain knowledge before they jump into learning
programming. Understanding the domain-specific data features, questions, and use cases will be
the key to retaining competitive advantage in an era of data-driven decision-making.
8
Cao, Jiang, &Lei
______________________________________________________________________________
Figure 3. The importance of data analytics and business domain knowledge
Tailoring data science to the needs of different corporate stakeholders
Data analytics for corporate talent
Managers and employees are heavily invested in their company’s current and future financial
well-being. This creates a robust demand for information on the company’s operating and financial
condition, including profitability, and future prospects, as well as comparative information on
competing peers and business opportunities. Such information is also required to design
compensation and incentive contracts. Information extracted using data analytics tools could assist
managers in addressing all these questions, including, for example:
•
What product lines, geographic areas, or other segments are performing well in
comparison to our peer companies and our own benchmarks?
•
Should we consider expanding or contracting our business?
•
How will current profit levels impact incentive- and share-based compensation?
•
What capital structure is suitable for our business?
•
How to improve cash flow management?
•
What is an appropriate dividend payout policy?
•
How are we doing compared to competitors?
9
Cao, Jiang, &Lei
______________________________________________________________________________
Data analytics for shareholders
Suppliers of capital to the companies, including shareholders and creditors, rely on data
analytics to gain insights into the company’s financial health, performance, and risks to ascertain
a proper level of cost of capital for the firm. Expectations of future profitability and cash
generation impact a company’s stock price and its ability to borrow money on favorable terms.
Investors also use company information to evaluate managerial performance. Here are some
examples of questions that information extracted using data analytics could assist investors in
addressing:
•
What are the expected future profits, cash flows, and dividends for input into stockprice models?
•
Is the company financially solvent and able to meet its financial obligations?
•
How do expectations about the economy, interest rates, and the competitive
environment affect the company?
•
Is company management demonstrating good stewardship of the resources to which
they have been entrusted?
•
Do we have the information we need to critically evaluate strategic initiatives proposed
by management?
Data analytics for supply chain partners
Suppliers and customers need company information to make important decisions regarding
their financial transactions and business relationships. For example, suppliers can use data
analytics to establish credit terms and evaluate their long-term commitment to supply-chain
relations. They also use company information to monitor and adjust their contracts and
commitments. Customers seek company information to assess a company’s ability to provide
10
Cao, Jiang, &Lei
______________________________________________________________________________
products or services, as well as its staying power and reliability. Here are some examples of
questions that creditors and suppliers can address with the help of data analytics:
•
Should we extend credit in the form of a loan or line of credit for inventory purchases?
•
Should we procure raw materials from this supplier?
Data analytics for other stakeholders
Auditors rely on company information to detect potential financial misstatements.
Governments demand company information to ensure compliance with laws and regulations. As
illustrated in Figure 4, all stakeholders can uncover critical insights for their decision-making
through the application of data analytics.
Figure 4. Firm as a nexus of contracts
11
Cao, Jiang, &Lei
______________________________________________________________________________
1.2. Overview of structured and unstructured data
An overview of available business data
Unstructured data analytics
An overview of unstructured data analytics
Advances in data analytics techniques have led to new sources of information that enable us to
tackle a broader range of problems. Recently, there have also been innovations in the methods
used to create new data. Information collected from these fresh sources or generated through these
new mechanisms is largely unstructured qualitative data.
More and more information users are drawn to texts in firm regulatory filings. Despite
containing financial statements and pages of tables and charts, annual reports and other company
regulatory filings still consist mostly of text. The recent reduction in computer storage costs and
an increase in computer processing capabilities have made textual analysis of these disclosures
more feasible. Regulatory filings that are widely available for analysis include annual reports,
current reports, proxy statements, initial public offering (IPO) prospectuses, and more.
Conference call transcripts enable analysts to capture and analyze information disclosed
during corporate conference calls. These calls provide an opportunity for managers to announce
and discuss the firm’s financial performance, while allowing analysts and investors to pose
relevant questions about the company.
Environmental, social, and responsibility (ESG) reports serve as internal and external
communications detailing a company’s ESG initiatives and their impact on the environment and
society. While some countries mandate the annual publication of ESG reports, many companies in
regions without such requirements also voluntarily release them.
12
Cao, Jiang, &Lei
______________________________________________________________________________
Social media has become a vital communication channel for businesses. Social media
platforms enable the rapid dissemination of information to millions of people within seconds. This
evolution in information exchange has opened up a new range of opportunities for companies to
inform and interact with stakeholders. Company information shared on social media platforms,
whether by the company itself or by investors, consumers, competitors, and others, offers an
additional perspective on a company’s operations, performance, and risks.
Audio data pertaining to business activities can also enhance decision-making processes. For
example, in addition to analyzing textual transcripts of conference calls and investment
presentations, audio recordings of these events can provide valuable nuances to analysts.
Video and image data are more widely used than ever before due to the progress of video and
image capture devices. Development in computer algorithms enables the processing and
interpretation of static images and the derivation of objective information from videos. Productrelated images provided by companies or shared by customers are examples of potentially valuable
image data. Videos of investor presentations, product releases, and other company events could be
valuable for managers, investors, and other decision-makers.
Structured data analytics
An overview of structured data analytics
Traditionally, company information is financial in nature and comprises structured
quantitative data aggregated and used to prepare financial statements for internal and external
information users. Information intermediaries and other marketplace agents also produce company
information for capital market participants.
Financial statements provide critical financial information in accordance with applicable
accounting standards to ensure the relevancy, reliability, and comparability of firm information.
13
Cao, Jiang, &Lei
______________________________________________________________________________
Companies use four financial statements to periodically report on their business activities: the
balance sheet, income statement, statement of stockholders’ equity, and statement of cash flows.
The balance sheet reports on a company’s financial position at a specific point in time, while the
income statement, statement of stockholders’ equity, and statement of cash flows report on
performance over a period of time. These three statements link the balance sheet from the
beginning to the end of a period.
Executive compensation disclosures provide information concerning the amount and type of
compensation paid to a firm’s chief executive officer, chief financial officer, and the other three
highest-paid executive officers. The company must also disclose the criteria used for executive
compensation decisions and the relationship between the company’s executive compensation
practices and corporate performance. The Summary Compensation Table, included in the proxy
statement, is the cornerstone of the required disclosures. The Summary Compensation Table
provides a comprehensive overview of a company’s executive compensation practices in a single
chart. It is followed by other tables and disclosures with more specific information on the
components of compensation for the last completed fiscal year, for example, information about
grants of stock options, stock appreciation rights, long-term incentive plan awards, pension plans,
employment contracts and related arrangements.
Financial analyst forecasts and recommendations provide useful processed information
from financial experts. Financial analysts provide short-term and long-term forecasts on various
financial metrics, including earnings, sales, capital expenditures, etc. Financial analysts also
regularly issue investment recommendations.
Loan agreements include details such as loan terms, amounts, interest rates, and required
collateral. Loan agreements often include contractual requirements, called covenants, that restrict
14
Cao, Jiang, &Lei
______________________________________________________________________________
a company’s behavior in some fashion. Violation of loan covenants can lead to early repayment or
other compensation demanded by the lender. Information in loan agreements thus reflects the
creditor’s assessment of a company’s credit risk.
Patents and citations are valuable for understanding the innovations a company has in
development. Patent filings protect the most significant discoveries, providing a wealth of
technological, geographical, and industry data. The relationship between an invention’s economic
importance and patent data is well-documented. Unsurprisingly, patent data has become
increasingly used in business analytics.
In this book, unstructured textual and image data are discussed in chapter 2 to chapter 9;
structured numerical data are introduced in chapter 10 and chapter 11.
Figure 5. Structured and unstructured business data
15
Cao, Jiang, &Lei
______________________________________________________________________________
Figure 6. Data analytics based on structured and unstructured business data
1.3. Theory-driven and machine-learning approach of data analytics
Theory-driven and machine-learning approach of data analytics
Theory-driven approach vs. machine-learning approach
When analyzing either quantitative or qualitative data, two approaches can be taken in data
analytics. The theory (hypothesis)-driven approach relies on theory-based hypotheses to guide
the direction of data analytics, whereas the machine-learning approach starts with supplying data
to a computer model to train itself to identify patterns or make predictions. The theory-driven
approach resembles the human thinking process, making it intuitive and interpretable. In contrast,
the machine-learning approach leverages the computational power of machines to yield strong
predictive capability, but the machine learning process remains a “black box.”
As an example, the Securities and Exchange Commission (SEC) charged Luckin Coffee Inc.
with material misstatement of financial statements to falsely appear to achieve rapid growth and
increased profitability and to meet the company’s earnings estimates. The fraud was uncovered by
Muddy Waters LLC, an investment research firm specializing in detecting financial fraud. The
16
Cao, Jiang, &Lei
______________________________________________________________________________
firm received an anonymous tip and mobilized 92 full-time and 1,418 part-time staff on the ground
to run surveillance, recording store traffic for 981 store-days covering 100% of the operating hours
of 620 stores. The investigation resulted in more than 11,200 hours of videotaping and led to the
conclusion that the number of items per store was inflated by at least 69 percent in the third quarter
of 2019 and 88 percent in the fourth quarter. This is a typical investigation following the theory
(hypothesis)-driven approach, where Muddy Waters LLC first formed a hypothesis that Luckin
Coffee Inc. misstated financial statements and then conducted an investigation to examine the
hypothesis. In contrast, financial statement auditors are required to perform analytical procedures
that aim at detecting potential anomalies in financial reporting without necessarily having a
hypothesis that a company misstates financial statements. This is an example of machine-learningdriven approach.
While availability of a sea of new data make the machine-learning approach an attractive
opportunity, it is important to remember that theory is the guiding principle that helps us think
through the patterns uncovered by machine-learning techniques. There is ample room for both
theory (hypothesis)-driven approach and machine-learning approach in the field of accounting and
finance (Goldstein et al. 2019).
Both approaches can be employed to perform three types of data analytics: descriptive,
inference, and predictive. Descriptive analytics summarize data and describe observable patterns,
focusing on understanding what has happened over a period of time. Descriptive analytics
techniques include descriptive statistics and cross-tabulation of data. However, conventional
descriptive analytics rely mostly on structured numeric data. To offer a more complete picture, AI
provides the necessary technologies to retrieve novel data from various structured and unstructured
sources.
17
Cao, Jiang, &Lei
______________________________________________________________________________
Inference analytics seek to understand what happened and why, involving more diverse data
inputs and a deeper dive into the data. Inference analytic techniques include correlations,
regressions, and other statistical methods. For instance, inference analytics can be applied to
investigate the economic implications of AI applications in accounting and finance, such as, virtual
currency, digital payment, peer-to-peer loans, crowdfunding, blockchain, and robo-investing
(Goldstein, Jiang, and Karolyi 2019).
Finally, predictive analytics explores what is likely to happen or what will happen “if”
something else happens. For example, when faced with new technologies, a critical decision is
whether to embrace the new technology, stick with the existing technology, or invest in both.
Would tablets ultimately replace laptops? Would all-electric cars dominate hybrid cars eventually?
Chandrasekaran, Tellis, and James (2022) draw on the theory of disruptive change to develop a
model of the diffusion of successive technologies that helps managers estimate and predict
technological leapfrogging, cannibalization, and coexistence. To predict what will happen, we first
need to understand what happened, how, and why; hence, predictive analytics builds on descriptive
and diagnostic analytics. Conventional predictive analytic techniques involve building models
using past data and statistical techniques, including regression and a deep understanding of cause
and effect. Machine-learning algorithms allow patterns to be learned from training a dataset, and
predictive models to be built with limited human intervention.
18
Cao, Jiang, &Lei
______________________________________________________________________________
Figure 7. Theory-driven and machine-learning approach of data analytics
1.4. The advantages of applying machine-learning approaches
Machine learning has several unique advantages compared with conventional data analysis
techniques. First, traditional statistical methods often struggle with large amounts of data. In
contrast, machine learning algorithms, such as convolutional neural networks (CNNs), can select
the best features to process information effectively. Another advantage of machine learning is its
ability to handle nonlinear relationships in data. Along with the information explosion, the types
of information available to analysts have grown from mostly numerical data to more complex data
involving both text and images. Machine learning can identify nonlinear patterns and make
predictions using various types of data, making it particularly valuable for tasks such as natural
language processing and computer vision. Machine learning also offers the ability to make out-ofsample predictions, which is useful in cases where data is limited. In traditional statistical methods,
repeated optimization (learning) is often required to improve the accuracy of the model. However,
in machine learning, this process is streamlined and only requires one-time optimization. For tasks
involving time series data, machine learning algorithms, such as long short-term memory (LSTM)
networks, can be applied to identify time-series patterns. Finally, machine learning algorithms are
19
Cao, Jiang, &Lei
______________________________________________________________________________
designed to be efficient, which is important given the increasing amounts of data being generated.
By using optimized algorithms, machine learning can process data quickly and effectively, making
it a valuable tool in fields such as healthcare and finance, where time is of the essence.
Machine learnings techniques include supervised and unsupervised learning, self-supervised
learning, transfer learning, ensemble learning, and more. Supervised learning involves training a
model with labeled data. On the contrary, unsupervised learning does not require labeled data and
instead focuses on finding patterns and relationships within the data itself. Self-supervised learning
is a variant of unsupervised learning that uses the data itself to generate labels, such as using stock
returns to label news positivity. Transfer learning is a powerful technique that uses a pre-trained
model to tackle new tasks with less training data. This approach is useful in cases where the cost
of obtaining labeled data is high or where the amount of labeled data available is limited. These
machine-learning techniques are discussed in detail in later chapters in the book.
Although machine-learning techniques are powerful tools for analyzing data, we should still
keep the Occam’s razor principle in mind: we should try the simpler models as well as more
complex machine-learning techniques, and make an informed tradeoff between performance and
complexity (James, Witten, Hastie, and Tibshirani 2023).
20
Cao, Jiang, &Lei
______________________________________________________________________________
References
Cao, S., Jiang, W, Wang, J, and Yang, B. 2024. From man vs. machine to man + machine: The art
and AI of stock analysis. Journal of Financial Economics, forthcoming.
Cao, S., Jiang, W, Yang, B, and Zhang, A. 2023. How to talk when a machine is listening?
Corporate disclosure in the age of AI. Review of Financial Studies, 36(9), 3603-3642.
Chandrasekaran, D., Tellis, G., and James, G. 2022. Leapfrogging, cannibalization, and survival
during disruptive technological change: The critical role of rate of disengagement. Journal
of Marketing, 86(1), 149-166.
Goldstein, I., Jiang, W., and Karolyi, G.A. 2019. To Fintech and beyond. Review of Financial
Studies, 32(5), 1647-1661.
James, G., Witten, D., Hastie, T., and Tibshirani, R. 2023. An introduction to statistical learning.
Springer.
21
Cao, Jiang, &Lei
______________________________________________________________________________
Chapter 2 Analyzing Annual Reports
2.1. Data structure in annual reports and 10-K filings
Data structure of the 10-K filing
Each year, U.S. public companies are required to produce a Form 10-K and file it with the U.S.
Securities and Exchange Commission (SEC) within 60 days of the end of the fiscal year. The
SEC’s Electronic Data Gathering, Analysis, and Retrieval (EDGAR) database system allows
anyone to retrieve a company’s 10-K report. Some companies also post their 10-K reports on their
websites. In addition, SEC rules mandate that companies send an annual report to their
shareholders in advance of annual meetings. While both annual reports and 10-K filings provide
an overview of the company’s performance for the given fiscal year, annual reports tend to be much
more visually appealing than 10-K filings. Companies put effort into designing their annual
reports, using graphics and images to communicate data, while 10-K filings only report numbers
and other qualitative information, devoid of design elements or additional flair.
2.1.1. Data structure in 10-K filings
A comprehensive Form 10-K contains four parts and 15 items. Researchers are often most
interested in Item 1, “Business,” Item 1A, “Risk Factors,” and Item 7, “Management’s Discussion
and Analysis of Financial Condition and Results of Operations.” Therefore, we begin our
discussion with these three items of Form 10-K.
Item 1, “Business,” appears in Part I. It gives a detailed description of the company’s business,
including its main products and services, its subsidiaries, and in which markets it operates. To gain
an understanding of a company’s operations and its primary products and services, Item 1 serves
as an excellent starting point. Figure 1 shows an excerpt of Item 1 of Apple Inc.’s 2021 10-K. It
introduces the company’s background and provides information on Apple’s main products.
22
Cao, Jiang, &Lei
______________________________________________________________________________
Figure 1. Excerpt from Item 1 of Apple Inc.’s 2021 Form 10-K
Item 1A, “Risk Factors,” is also found in Part I of Form 10-K. It outlines the most significant
risks faced by the company or its securities. In practice, this section focuses on the risks
themselves, not on how the company addresses those risks. The outlined risks may pertain to the
entire economy or market, the company’s industry sector or geographic region, or be unique to the
company itself. Figure 2 shows an excerpt of Item 1A of Apple Inc.’s 2021 10-K, which discusses
business risks arising from the COVID-19 pandemic, such as disruptions in the supply chain and
logistical services and store closures.
Item 7, “Management’s Discussion and Analysis of Financial Condition and Results of
Operations,” presents the company’s perspective on its financial performance during the prior
fiscal year. This section, commonly referred to as the MD&A, allows company management to
summarize its recent business in its own words. The MD&A presents:
23
Cao, Jiang, &Lei
______________________________________________________________________________
•
The company’s operations and financial results, including information about the
company’s liquidity and capital resources and any known trends or uncertainties that could
materially affect the company’s results. This section may also present the management’s
views on key business risks and how they are being addressed.
•
Material changes in the company’s results compared to the prior period, as well as offbalance-sheet arrangements and contractual obligations.
•
Critical accounting judgments, such as estimates and assumptions.
Figure 3 is an excerpt from Item 7 of Target’s 2021 10-K. It begins with highlights of the fiscal
year, provides a summary of financial outcomes, and then continues to analyze key performance
indicators, such as the gross margin.
Figure 2. Excerpt from Item 1A of Apple Inc.’s 2021 Form 10-K
24
Cao, Jiang, &Lei
______________________________________________________________________________
Figure 3. Excerpt from Item 7 of Target’s 2021 Form 10-K
Other Items in Form 10-K
Part I of Form 10-K
Part I of the report comprises two additional items in addition to Item 1 and Item 1A. Item 1B,
“Unresolved Staff Comments,” requires the company to explain certain comments received from
SEC staff on previously filed reports that have not been resolved after an extended period of time.
Item 2, “Properties,” describes the company’s significant physical properties, such as principal
plants, mines, and other materially important physical properties. Figure 4 displays Item 2 from
25
Cao, Jiang, &Lei
______________________________________________________________________________
Apple Inc.’s 2021 10-K. It reveals that Apple Inc. owns and leases facilities and land within the
U.S. and outside the U.S.
Figure 4. Excerpt from Item 2 of Apple Inc.’s 2021 Form 10-K
Item 3, “Legal Proceedings,” requires companies to disclose information about significant
pending lawsuits or other legal proceedings other than ordinary litigation. Figure 5 displays Item
3 from Apple Inc.’s 2021 10-K. It is worth noting that it is not uncommon for companies to be
involved in legal proceedings. Item 4 has no required information and is reserved by the SEC for
future rulemaking.
Figure 5. Excerpt from Item 3 of Apple Inc.’s 2021 Form 10-K
Part II of Form 10-K
Part II of Form 10-K comprises seven items in addition to Item 7. Item 5, “Market for
Registrant’s Common Equity, Related Stockholder Matters and Issuer Purchases of Equity
Securities,” provides information about the company’s equity securities, including market
information, the number of shareholders, dividends, stock repurchases by the company, and other
relevant information. Figure 6 provides an example of Item 5 in Apple Inc.’s 2021 10-K.
Item 6, “Selected Financial Data,” provides a summary of certain financial information from
the past five years. As shown in Figure 7, Item 6 of Apple Inc.’s 2021 Form 10-K reports selected
26
Cao, Jiang, &Lei
______________________________________________________________________________
financial information from 2016 to 2021. More detailed financial information for the past three
years is included in a separate section: Item 8, “Financial Statements and Supplementary Data,”
which includes the company’s balance sheet, income statement, cash flow statement, and notes to
the financial statements.
Figure 6. Excerpt from Item 5 of Apple Inc.’s 2021 Form 10-K
Item 7A, “Quantitative and Qualitative Disclosures about Market Risk,” mandates disclosure
of the company’s exposure to market risks arising from, for example, fluctuations in interest rates,
foreign currency exchanges, commodity prices, or equity prices. This section may also include
information on how the company manages these risks. Figure 8 provides an excerpt from Item 7A
in Apple Inc.’s 2021 Form 10-K.
27
Cao, Jiang, &Lei
______________________________________________________________________________
Figure 7. Excerpt from Item 6 of Apple Inc.’s 2021 Form 10-K
Item 8, “Financial Statements and Supplementary Data,” mandates the company’s audited
financial statements, which include the company’s income statement, balance sheet, statement of
cash flows, and statement of stockholders’ equity. The financial statements are accompanied by
notes that elucidate the information presented in the financial statements. An independent
accountant audits these financial statements and, for large companies, also reports on their internal
controls over financial reporting.
Item 9, “Changes in and Disagreements with Accountants on Accounting and Financial
Disclosure,” requires companies that have changed accountants to discuss any disagreements they
had with those accountants. Such disclosure is often seen as a red flag by many investors. Item
9A, “Controls and Procedures,” discloses information about the company’s disclosure controls and
procedures, as well as its internal controls over financial reporting. Item 9B, “Other Information,”
requires companies to provide any information that should have been reported on another form
during the fourth quarter of the year covered by the 10-K but was not disclosed.
28
Cao, Jiang, &Lei
______________________________________________________________________________
Figure 8. Excerpt from Item 7A of Apple Inc.’s 2021 Form 10-K
Part III of Form 10-K
Part III of the 10-K includes five items. Item 10, “Directors, Executive Officers and Corporate
Governance,” requires information about the background and experience of the company’s
directors and executive officers, the company’s code of ethics, and certain qualifications for
directors and committees of the board of directors. Item 11, “Executive Compensation,” requires
detailed disclosures of the company’s compensation policies and programs, as well as how much
compensation was paid to its top executive officers in the past fiscal year.
In Item 12, “Security Ownership of Certain Beneficial Owners and Management and Related
Stockholder Matters,” companies provide information about the shares owned by the company’s
directors, officers, and certain large shareholders. This item also includes information about shares
covered by equity compensation plans.
Item 13, “Certain Relationships and Related Transactions, and Director Independence,”
includes information about relationships and transactions between the company and its directors,
officers, and their family members. It also includes information about whether each director of the
company is independent.
29
Cao, Jiang, &Lei
______________________________________________________________________________
Item 14, “Principal Accountant Fees and Services,” requires companies to disclose fees paid
to their accounting firm for various types of services during the year. Although this disclosure is
required as part of Form 10-K, most companies provide this information in a separate document
called the proxy statement. Companies distribute the proxy statement among their shareholders in
preparation for annual meetings. If the information was provided in a proxy statement, Item 14
will include a message from the company directing readers to the proxy statement document. The
proxy statement is typically filed a month or two after the 10-K. Part III of Apple Inc.’s 2021 10K is provided in Figure 9 as an example.
Part IV of 10-K
Part IV contains Item 15, “Exhibits, Financial Statement Schedules,” which outlines the
financial statements and exhibits included as part of the 10-K filing. Many exhibits are mandatory,
including documents such as the company’s bylaws, copies of its material contracts, and a roster
of the company’s subsidiaries.
2.1.2. Data structure in annual reports
Similar to 10-Ks, annual reports are comprehensive reports detailing companies’ performance
and activities during a fiscal year. Many companies choose to incorporate a lot of graphics and
images instead of large amounts of text in their annual reports to create more visually appealing
documents than 10-Ks. For example, in Figure 10, Procter and Gamble provide both numeric and
graphic information regarding its financial performance in the 2021 annual report.
30
Cao, Jiang, &Lei
______________________________________________________________________________
Figure 9. Part III of Apple Inc.’s 2021 Form 10-K
The structure of annual reports varies across companies, but they typically include several
common sections such as: (1) letter to shareholders, (2) performance and highlights, (3) corporate
strategies, (4) non-financial information such as environmental, social, and governance (ESG)
information, (5) financial information, (6) leadership information, and any other pertinent
information the company wishes to share.
2.2. Conventional textual analysis approach
2.2.1. Conventional Approach Review (Bag of Words)
Bag of words and LDA
The “bag of words” technique is a Natural Language Processing (NLP) technique used for
textual modeling. Text data can be messy and unstructured, posing challenges for machine learning
algorithms in analysis. These algorithms prefer structured, well-defined, fixed-length inputs. A
“bag of words” is a textual representation of the occurrence of words within a document. To create
31
Cao, Jiang, &Lei
______________________________________________________________________________
this representation, analysts track the frequency of word occurrences in a document, disregarding
grammatical details and word orders. The term “bag” is used because information about the order
or structure of words in the document is discarded, and all words are collected en masse as if in a
bag. Using this technique, variable-length texts can be converted into a fixed-length vector. The
bag-of-words approach is a simple and flexible method to extract features from documents.
Figure 10. Financial highlights in P&G’s 2021 annual report
Building a Keyword Dictionary
A pre-established set of keywords is required to utilize the bag-of-words approach. Sentiment
analysis, for instance, can be conducted by computing the frequency of pre-determined negative
32
Cao, Jiang, &Lei
______________________________________________________________________________
and positive words. By comparing the number of negative words to provide words, the bag-ofwords can identify the sentiment of a text as negative without the need to read the entire document.
Existing keyword lists
There is a range of well-established keyword lists available for textual analyses. In sentiment
analysis, for example, the Harvard-IV-4 Dictionary is a general-purpose dictionary that provides
a list of positive and negative words developed by Harvard University. The Loughran-McDonald
Sentiment Word Lists are widely used in technical accounting and finance texts. Other researchers
have developed similar keyword lists for non-English languages (Du, Huang, Wermers, and Wu
2022) or for purposes other than sentiment analysis, such as forward-looking statements, extreme
sentiment, deception, financial constraints, uncertainty, financial performance, research and
development. Technology, intangible assets, culture, big data and AI, litigation, social affiliation,
supply chain, etc. (Cao, Ma, Tucker, and Wan 2018; Hassan, Hollander, Lent, and Tahoun 2019,
etc.).
Self-defined keywords
When a suitable keyword list is not available for a specific research question, we can create a
customized one by reading a small sample of related texts and selecting the most relevant
keywords. This approach is easy to implement, but it can also be arbitrary. Below, we discuss two
structured approaches to developing self-defined keyword lists.
Corpus approach
The corpus approach begins with gathering textual contents relevant to the topic of interest,
from which a set of frequently used words {A} is extracted. This set often includes noisy keywords
unrelated to the topic. To eliminate this noise, we then identify irrelevant topics and generate a list
of frequently used words for each irrelevant topic {Bi}. A robust keyword list for the topic of
33
Cao, Jiang, &Lei
______________________________________________________________________________
interest is then obtained by subtracting irrelevant topic keywords from the preliminary highfrequency word list, or {A}-U{Bi}. For example, to generate a list of political keywords {Ap}, one
might start with political science textbooks to generate a high-frequency word list {Ap0}. This
preliminary high-frequency word list might contain keywords relating to economics, law, science,
etc. To remove these irrelevant topics, we could use a similar approach to generate lists of highfrequency words for each irrelevant topic {Beconomics}, {Blaw},{Bscience}, etc. Finally, we subtract
these irrelevant keywords from the preliminary political keyword list, resulting in a clean political
keyword list, or {Ap}={Ap0}-U{Bi}. Figure 11 illustrates the process of using the corpus approach
to develop a dictionary.
Figure 11. Using the corpus approach to develop keyword lists
Dictionary expansion
The “dictionary expansion’ approach generates an expanded keyword list by searching for
synonyms of key topical words in authoritative dictionaries. For instance, to create a keyword list
for “risk,” we can begin with the single word “risk” and look up all synonyms of “risk” in the
Merriam-Webster Dictionary. This may yield words like “threat” and “danger.” Subsequently, we
can look up the synonyms of these synonyms, which could provide words such as “menace,”
34
Cao, Jiang, &Lei
______________________________________________________________________________
“jeopardy,” and “trouble.” The process can be continued until the additional synonyms are no
longer closely related to the original concept of “risk.” (Figure 12).
Figure 12. Using the dictionary expansion approach to develop keyword lists
IBM Research-Almaden developed a “human-in-the-loop” approach for AI dictionary
expansions (Alba, Gruhl, Ristoski, and Welch 2018). The approach not only discovers new
instances from the input text corpus but also predicts new “unseen” terms not currently in the
corpus. The approach runs in two phases. Continuing with the political word example, during the
explore phase, the model calculates a similarity score between words in the Merriam-Webster
Dictionary and the single word “politics” to identify instances in the dictionary that are similar to
the word “politics,” such as “activism,” “legislature,” or “government.” In the exploit phase, the
model generates new phrases based on a word’s co-occurrence score or how often words appear
together. For example, “government policy” may not appear in the Merriam-Webster Dictionary,
but “political policies” and “science of government” often appear together and can be used to build
the more complex phrase “government policy.”
35
Cao, Jiang, &Lei
______________________________________________________________________________
2.2.2. Latent Dirichlet Allocation (LDA)
Bag of words and LDA
Latent Dirichlet Allocation (LDA) is often used for dimensionality reduction. Unsupervised
LDA proves useful in exploring unstructured text data by inferring relationships between words in
a set of documents. A common application of unsupervised LDA is topic modeling. When given
a sample of textual data and a pre-determined number of topics, K, an LDA algorithm can generate
K topics that best fit the data. Determining the appropriate number of topics is somewhat arbitrary.
The best practice is to review the textual sample to obtain a sense of the contents, generate a word
frequency table to examine the high-frequency words, and then determine the number of topics in
an informed manner. Figure 13 illustrates the keywords for the topic “politics” developed using an
unsupervised LDA algorithm.
In addition to unsupervised LDA, LDA can also be supervised. Supervised LDA requires
humans to read a small sample of the textual contents and label the topics for each textual input.
The labeled sample is then used to train a model that predicts the topics of the remaining texts in
the sample. The self-supervised approach involves using an existing label to label keywords. For
instance, subsequent stock returns can be used to label positive and negative keywords when
determining positive and negative keywords in earnings announcements (Figure 14).
36
Cao, Jiang, &Lei
______________________________________________________________________________
Figure 13. Keywords for the topic “politics” generated using unsupervised LDA
Figure 14. Using supervised LDA to develop keyword lists
37
Cao, Jiang, &Lei
______________________________________________________________________________
2.3. Empirical examples: Analyzing corporate filings for making business decisions
An empirical example of analyzing 10-K filings
10-Ks and 10-Qs
When companies modify 10-Ks, this often provides asignal about future operations (Brown
and Tucker 2011; Cohen, Malloy, and Nguyen 2020). However, Cohen et al. (2020) document
that investors tend to neglect the valuable information embedded in the changes. Constructing a
portfolio that shorts companies making significant changes to their 10-Ks or 10-Qs while buying
those not making significant changes could yield returns of 30 to 50 basis points per month over
the subsequent year.
Cao, Jiang, Yang, and Zhang (2023) is the first study exploring the feedback effect on corporate
disclosure in response to technology. They document that firms with higher machine downloads
of their SEC filings prepare 10-Ks and 10-Qs is a more machine-friendly manner that facilitates
machine parsing, scripting, and synthesizing. Furthermore, firms anticipating high machine
downloads avoid negative words in Loughran and McDonald (2011) after 2011, the year of
publication of the LM dictionary.
Item 1 of 10-K
In 10-K Item 1, companies elaborate on their products and services. The textual descriptions
in Item 1 can be leveraged to construct a stream of measures based on product similarity. Hoberg
and Phillips (2016) extract product descriptions from 10-K Item 1 and represent word usage with
a binary vector. The cosine similarity score between a pair of companies’ product descriptions then
captures the similarity of the products between the two companies. This product similarity measure
is useful in evaluating the level of competition a company encounters. If a large number of
companies provide highly similar products or services, the given company is likely to face intense
38
Cao, Jiang, &Lei
______________________________________________________________________________
competition in the product market. The measure can also refine industry classification, especially
as many modern companies span multiple traditional industries. For example, Amazon Inc.
operates as a retailer in the retailing industry, a streaming service provider in the entertainment
industry, an electronic device maker in the manufacturing industry, and a software provider in the
computer and business service industry. The product similarity measure provides an avenue to
define an “industry” for Amazon that consists of companies providing a similar set of products
and services rather than arbitrarily assigning Amazon Inc. to a traditional industry. In a related
vein, measuring the time-series similarity of Item 1 could assist analysts in detecting whether a
company launches new products or services and implements new strategies.
Figure 15. Frequency of Loughran and McDonald (2011) negative words
This figure plots LM – Harvard Sentiment of 10-K and 10-Q filings and compares the sentiment of the documents filed by firms with high
machine downloads with that of the low group. LM – Harvard Sentiment is the difference between LM Sentiment and Harvard Sentiment. LM
Sentiment is defined as the number of Loughran-McDonald (LM) finance-related negative words in a filing divided by the total number of words
in the filing. Harvard Sentiment is defined as the number of Harvard General Inquirer negative words in a filing divided by the total number of
words in the filing. Filings are sorted into top tercile or bottom tercile based on Machine Downloads. LM Sentiment and Harvard Sentiment
sentiments are normalized to one, respectively, in 2010 within each group, one year before the publication of Loughran and McDonald (2011). The
dotted lines represent the 95% confidence limits.
Source: Cao et al. (2023)
39
Cao, Jiang, &Lei
______________________________________________________________________________
Item 1A of 10-K
In Item 1A, companies delineate risk factors impacting their business. Henley and Hoberg
(2019) developed an emerging risks database for banks based on risk factor disclosures in Item
1A. They employ topic modeling to obtain a 25-factor Latent Dirichlet Allocation (LDA) model,
which is then used to extract 625 bigrams. Figure 16 provides an overview of the 25 emerging risk
topics and the five most prevalent words in each topic. The bigrams are then converted into a set
of interpretable risk factors in the form of word vectors using semantic vector analysis. The cosine
similarity between the vocabulary list associated with each risk theme and the raw text of a bank’s
Item 1A disclosure reflects the intensity of the bank’s discussion of each emerging risk. Using the
risk loading, Henley and Hoberg (2019) show that risks related to real estate, prepayment, and
commercial paper are elevated as early as mid-2005, prior to the 2008 financial crisis. They also
find that individual bank exposure to emerging risk factors strongly predicts stock returns, bank
failure, and return volatility.
Segment Information
Recruiting CEOs whose skills and attributes align with company needs is critical for company
success. The inherent challenge in external CEO hiring arises from the heterogeneity of both job
candidates and companies. Effective recruiting involves not only identifying competent managers
but also optimizing the quality of the match between companies and CEOs. Cao, Li, and Ma (2022)
find that segment information in 10-K filings helps companies find CEOs who fit their needs. For
instance, when Ford hired Allan Mulally as the CEO from Boeing’s Commercial Airplanes in
2006, they were seeking a leader with experience in turning around a troubled corporate giant.
Allan Mulally happened to possess that experience, as revealed by the segment information
disclosed in Boeing’s 1999 10-K (Figure 17). The Commercial Airplanes segment of Boeing
40
Cao, Jiang, &Lei
______________________________________________________________________________
incurred a loss of 1,589 million in 1997 but achieved a profit of 2,016 million in the following two
years. This experience perfectly matched what Ford valued in CEO candidates.
Figure 16. Emerging risks using LDA with 25 topics
This figure provides an overview of the 25 risk factors detected using topic modeling from 10-Ks of banks for fiscal year 2006. Each box is
ranked and sized relative to its importance in the document and contains the five most prevalent words or common grams in the topic (Henley and
Hoberg 2019).
Source: Henley and Hoberg (2019)
Figure 17. Segment information in Boeing’s 1999 10-K
41
Cao, Jiang, &Lei
______________________________________________________________________________
Appendix 2A: Project 1a How to crawl annual reports
Download ten 10-K filings and twenty 8-K filings of a company of your choice using the SEC
API. Randomly read 10 files that you download and check whether the downloaded filings are
correct and complete. Write a document explaining your code.
Appendix 2B: Project 1b How to parse unstructured data
Parse Item 1, Item 1A, and Item 7 of each of the ten 10-K filings you downloaded. Read the
output and check whether the output is complete and accurate by comparing it with the original
files. Write a document explaining your code.
42
Cao, Jiang, &Lei
______________________________________________________________________________
Appendix 2C: Solution
How to crawl annual reports
How to crawl annual reports
1. from datetime import date
2. import time
3. from collections import namedtuple
4. from pathlib import Path
5. from typing import List
6. from urllib.parse import urljoin
7.
8. import requests
9. from bs4 import BeautifulSoup
10. from requests.adapters import HTTPAdapter
11. from urllib3.util.retry import Retry
12. import pandas as pd
13.
14. SEARCHAPIENDPOINT = “https://efts.sec.gov/LATEST/search-index”
15. ARCHIVESBASEURL = “https://www.sec.gov/Archives/edgar/data”
16. SLEEPTIME = 0.2
17. MAXRETRIES = 10
18. DATE_FORMAT_TOKENS = “%Y-%m-%d”
19. AFTER_DATE = date(2000, 1, 1)
20. BEFORE_DATE = date.today()
21.
22. retries = Retry(
23.
total=MAXRETRIES,
24.
backoff_factor=SLEEPTIME,
25.
status_forcelist=[403, 500, 502, 503, 504],
26. )
27.
28. ROOT_SAVE_FOLDER_NAME = “test”
29. FILING_FULL_SUBMISSION_FILENAME = “full-submission.txt”
30. FILING_DETAILS_FILENAME_STEM = “filing-details”
31.
32. FilingMetadata = namedtuple(
33.
“FilingMetadata”,
34.
[
“cik”,
35.
“file_date”,
36.
“period_end”,
37.
“accession_number”,
38.
“full_submission_url”,
39.
“filing_details_url”,
40.
“filing_details_filename”,
41.
],
42. )
43.
44. def form_request(
45.
ticker: str,
46.
filing_types: List[str],
47.
start_date: str,
48.
end_date: str,
49.
start_index: int,
50.
query: str,
51. ) -> dict:
52.
request = {
53.
“dateRange”: “custom”,
54.
“startdt”: start_date,
55.
“enddt”: end_date,
56.
“entityName”: ticker,
43
Cao, Jiang, &Lei
______________________________________________________________________________
57.
“forms”: filing_types,
58.
“from”: start_index,
59.
“q”: query,
60.
}
61.
return request
62.
63. def filing_metadata(hit: dict) -> FilingMetadata:
64.
accession_number, filing_details_filename = hit[“_id”].split(“:”, 1)
65.
cik = hit[“_source”][“ciks”][-1]
66.
file_date= hit[“_source”][“file_date”]
67.
period_ending=hit[“_source”][“period_ending”]
68.
accession_number_no_dashes = accession_number.replace(“-“, “”, 2)
69.
70.
submission_base_url = (
71.
f”{ARCHIVESBASEURL}/{cik}/{accession_number_no_dashes}”
72.
)
73.
full_submission_url = f”{submission_base_url}/{accession_number}.txt”
74.
filing_details_url = f”{submission_base_url}/{filing_details_filename}”
75.
filing_details_filename_extension = Path(filing_details_filename).suffix.replace(
76.
“htm”, “html”
77.
)
78.
filing_details_filename = (
79.
f”{FILING_DETAILS_FILENAME_STEM}{filing_details_filename_extension}”
80.
)
81.
82.
return FilingMetadata(
83.
cik,
84.
file_date,
85.
period_ending,
86.
accession_number=accession_number,
87.
full_submission_url=full_submission_url,
88.
filing_details_url=filing_details_url,
89.
filing_details_filename=filing_details_filename,
90.
)
91.
92. filings_to_download: List[FilingMetadata] = []
93.
94. def get_filing_urls(
95.
filing_type: str,
96.
ticker: str,
97.
num_filings_to_download: int,
98.
after_date: str,
99.
before_date: str,
100.
include_amends: bool,
101.
query: str = “”,
102. ) -> List[FilingMetadata]:
103.
104.
start_index = 0
105.
106.
client = requests.Session()
107.
client.mount(“http://”, HTTPAdapter(max_retries=retries))
108.
109.
110. try:
111.
while len(filings_to_download) str:
165.
soup = BeautifulSoup(filing_text, “lxml”)
166.
base_url = f”{download_url.rsplit(‘/’, 1)[0]}/”
167.
168.
for url in soup.find_all(“a”, href=True):
169.
if url[“href”].startswith(“#”) or url[“href”].startswith(“http”):
170.
continue
171.
url[“href”] = urljoin(base_url, url[“href”])
172.
173.
for image in soup.find_all(“img”, src=True):
174.
image[“src”] = urljoin(base_url, image[“src”])
175.
176.
if soup.original_encoding is None:
177.
return soup
178.
179.
return soup.encode(soup.original_encoding)
180.
181. def download_filings(
182.
download_folder: Path,
183.
ticker: str,
184.
filing_type: str,
185.
num_filings_to_download: int,
186.
after_date: str,
45
Cao, Jiang, &Lei
______________________________________________________________________________
187.
before_date: str,
188.
include_filing_details: bool,
189. ) -> None:
190.
client = requests.Session()
191.
client.mount(“http://”, HTTPAdapter(max_retries=retries))
192.
client.mount(“https://”, HTTPAdapter(max_retries=retries))
193.
try:
194.
for filing in filings_to_download:
195.
try:
196.
download(
197.
client,
198.
download_folder,
199.
ticker,
200.
filing.accession_number,
201.
filing_type,
202.
filing.full_submission_url,
203.
FILING_FULL_SUBMISSION_FILENAME,
204.
)
205.
except requests.exceptions.HTTPError as e:
206.
print(
207.
“Skipping full submission download for ”
208.
f”‘{filing.accession_number}’ due to network error: {e}.”
209.
)
210.
211.
if include_filing_details:
212.
try:
213.
download(
214.
client,
215.
download_folder,
216.
ticker,
217.
filing.accession_number,
218.
filing_type,
219.
filing.filing_details_url,
220.
filing.filing_details_filename,
221.
resolve_urls=True,
222.
)
223.
except requests.exceptions.HTTPError as e:
224.
print(
225.
f”Skipping filing detail download for ”
226.
f”‘{filing.accession_number}’ due to network error: {e}.”
227.
)
228.
finally:
229.
client.close()
230.
231. def download(
232.
client: requests.Session,
233.
download_folder: Path,
234.
ticker: str,
235.
accession_number: str,
236.
filing_type: str,
237.
download_url: str,
238.
save_filename: str,
239.
*,
240.
resolve_urls: bool = False,
241. ) -> None:
242.
headers = {
243.
“User-Agent”: user_agent(),
244.
“Accept-Encoding”: “gzip, deflate”,
245.
“Host”: “www.sec.gov”,
246.
}
247.
resp = client.get(download_url, headers=headers)
248.
resp.raise_for_status()
249.
filing_text = resp.content
250.
251.
if resolve_urls and Path(save_filename).suffix == “.html”:
46
Cao, Jiang, &Lei
______________________________________________________________________________
252.
filing_text = resolve_relative_urls(filing_text, download_url)
253.
254.
save_path = (
255.
download_folder
256.
/ ROOT_SAVE_FOLDER_NAME
257.
/ ticker
258.
/ filing_type
259.
/ accession_number
260.
/ save_filename
261.
)
262.
save_path.parent.mkdir(parents=True, exist_ok=True)
263.
save_path.write_bytes(filing_text)
264.
265.
time.sleep(SLEEPTIME)
266.
267. cwd = Path.cwd()
268. downloader=cwd/”C:/Users/ ”
269. download_filings(
270.
downloader,
271.
“AAPL”,
272.
“10-K”,
273.
10,
274.
“2012-01-01”,
275.
“2023-01-01″,
276.
include_filing_details=”TRUE”)
277.
How to parse unstructured data
How to parse unstructured data
1. import re
2. import glob
3. from bs4 import BeautifulSoup
4. import csv
5. from pathlib import Path
6.
7. files=glob.glob(“C:/Users/**/*.html”, recursive=True)
8.
9. def parser(text,section):
10.
11.
def extract_text(text, item_start, item_end):
12.
item_start = item_start
13.
item_end = item_end
14.
starts = [i.start() for i in item_start.finditer(text)]
15.
ends = [i.start() for i in item_end.finditer(text)]
16.
positions = list()
17.
for s in starts:
18.
control = 0
19.
for e in ends:
20.
if control == 0:
21.
if s < e:
22.
control = 1
23.
positions.append([s,e])
24.
item_length = 0
25.
item_position = list()
26.
for p in positions:
27.
if (p[1]-p[0]) > item_length:
47
Cao, Jiang, &Lei
______________________________________________________________________________
28.
item_length = p[1]-p[0]
29.
item_position = p
30.
31.
item_text = text[item_position[0]:item_position[1]]
32.
33.
return(item_text)
34.
35.
if section == 1 or section == 0:
36.
try:
37.
item1_start = re.compile(“item\s*[1][\.\;\:\-\_]*\s*\\b”, re.IGNORECASE)
38.
item1_end = re.compile(“item\s*1a[\.\;\:\-\_]\s*Risk|item\s*2[\.\,\;\:\-\_]\s*Prop”,
re.IGNORECASE)
39.
businessText = extract_text(text, item1_start, item1_end)
40.
except:
41.
businessText = “Something went wrong!”
42.
43.
if section == 2 or section == 0:
44.
try:
45.
item1a_start = re.compile(“(?
Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.
You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.
Read moreEach paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.
Read moreThanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.
Read moreYour email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.
Read moreBy sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.
Read more