A rough review of Capers Jones' Applied Software Measurement: Global Analysis of Software Productivity and Quality (3rd Edition)
The book cover promises "statistics from more than 12,000 software projects." A friend remarked that they couldn't imagine anything less useful than that. Were they right? I dug in.
As promised, the book does contain statistics aggregated from over 12,000 projects, ranging from the US ATC system to small Excel macros. Statistics include familiar metrics such as budget, cost, time estimated and required for delivery, language used, environment (e.g. government, academic, corporate), and lines of code. They also include one other metric, which proves to be the raison d'ĂŞtre of the entire book.
What the function point?
As I assume the readers of this blog may be aware, lines of code are a pretty terrible metric, which don't especially correlate to code complexity (or anything else of use). In order to say anything about complexity in relation to cost, timeline, or other factors, Capers Jones needs to provide an alternate metric, and he sure has a clear favorite.
In the mid-1970s IBM developed a general-purpose metric for software called the function point metric. Function points are the weighted totals of five external aspects of a software project: inputs, outputs, logical files, interfaces, and inquiries. Function point metrics quickly spread through the software industry. In the early 1980s a non-profit corporation of function point users was formed: the International Function Point Users Group. This association expanded rapidly and continues to grow. As of 2004 there are IFPUG affiliates in 23 countries and the IFPUG organization is the largest software measurement association in the world.
"Inquiries" seems useful, if it means a kind of "WTFs per function." But that is not what it actually means. What it appears to mean is something like user-accessible endpoints or code paths:
Examples of inquiries include user inquiry without updating a file, help messages, and selection messages. A typical inquiry might be illustrated by an airline reservation query along the lines of "What Delta flights leave Boston for Atlanta after 3:00 p.m.?" as an input portion. The response or output portion might be something like "Flight 202 at 4:15 p.m."
"Inputs" and "outputs" are fortunately, not based on number of bytes, but different types of input/output. Without having read the IFPUG guidelines, I interpret this as saying that two different forms filled by a user, or two different HTML templates output by a server, would constitute two different inputs or outputs. (After all, if number of inputs or outputs only referred to qualitatively different types of I/O, that would be a pretty useless metric for most projects.)
Examples of inputs include data screens filled out by users, magnetic tapes or floppy disks, sensor inputs, and light-pen or mouse-based inputs. [...] Examples of outputs include output data screens, printed reports, floppy disk files, sets of checks, or printed invoices.
This seems somewhat reasonable, in that each of these aspects of a program could be said to add significant complexity. I have some misgivings about what it seems to omit, especially in terms of business logic. A simple Spring Boot or Rails app with a large number of controllers but very little business logic may rate as less complex on this scale than a CLI that runs molecular dynamics simulations but only has one code path. (Of course, it's quite possible that, as someone who isn't trained to estimate function points, I'm completely wrong about that example.)
That example illuminates my larger issue with function points as a concept, which is that complexity of a software system is very difficult to pin down or enumerate. It involves a large number of judgements that range from objective to mostly subjective.
Complexity may even depend on factors such as the current milieu or ferment of software engineering or the current labor market. (As an example, consider a hypothetical alternate universe where LISP rather than C had been the predominant template for the most successful languages, or where scientific programming was much more common than web development. How would these factors change how easy or difficult an app is to develop and maintain, and thus how complex its development might be?)
Despite those shortcomings, I think that if function points could be easily measured and estimated, then they could provide a useful stopgap proxy for complexity. However...
You can't even accurately automate function point measurements.
There has been a long-standing problem with using function point metrics. Manual counting by certified experts is fairly expensive. Assuming that the average daily fee for hiring a certified function point counter in 2008 is $2,500 and that manual counting using the IFPUG function point method proceeds at a rate of about 400 function points per day, the result is that manual counting costs about $6.25 per function point. Both the costs and comparatively slow speed has been an economic barrier to the widespread adoption of functional metrics. These statements are true for IFPUG, COSMIC, Mark II, NESMA, and other major forms of function point metrics.
There are tools that attempt to measure function points. As Capers Jones enumerates, in a paragraph that mostly looks like a graveyard of programs that time forgot:
Software estimating and measurement tools that support function points as explicit metrics are fairly common. The SPQR/20 estimating tool, on the market in 1984, was the first such commercial tool to provide explicit support for function points as well as for LOC and KLOC metrics. Other estimating and measurement tools that support functional metrics of various flavors include (in alphabetic order): Asset-R, the Bridge, BYL [Before You Leap], CHECKMARK (in the United Kingdom), CHECKPOINT (in the United States and Europe), COCOMO II, ESTIMACS, GECOMO, KnowledgeP1an, MARS, PADS, Price-S, SEER, and SLIM.
(I looked for each of these online, and included links to every single one that I could find a primary source for. Some of these primary sources don't even provide anywhere to purchase or download the software.)
While each of these tools support (or supported, back in 1997 when they were still alive) function points as an output, Capers Jones points out that automated measures of function points are very inaccurate. In the following table, he compares the speed, cost, and time required to measure function points through various methods. "Pattern matching" and "backfiring from LOC" are the commercially available automated methods, and "accuracy of count" indicates error bars (i.e., higher values are bad). "Automated derivations" in this table specifically refers to "tools [that] have been built experimentally, but are not commercially available," and which "require formal requirements and/or design documents such as those using structured design, use cases, and other standard methods. This method has good accuracy, but its speed is linked to the rate at which the requirements are created. It can go no faster."
Method of Counting | Function Points Counted per Day | Compensation | Cost per Function Point | Accuracy of Count |
---|---|---|---|---|
Agile story points | 50 | $2,500 | $50.00 | 5% |
Use case manual counting | 250 | $2,500 | $10.00 | 3% |
Mark II manual counting | 350 | $2,500 | $7.14 | 3% |
IFPUG manual counting | 400 | $2,500 | $6.25 | 3% |
NESMA manual counting | 450 | $2,500 | $5.55 | 3% |
COSMIC manual counting | 500 | $2,500 | $5.00 | 3% |
Automatic derivation | 1,000 | $2,500 | $2.50 | 5% |
"Light" function points | 1,500 | $2,500 | $1.67 | 10% |
NESMA "indicative" counts | 1,500 | $2,500 | $1.67 | 10% |
Backfiring from LOC | 10,000 | $2,500 | $0.25 | 50% |
Pattern-matching | 300,000 | $2,500 | $0.01 | 15% |
Based on the values in that table, accuracy of function point values obtained by commercially available automated measurement tools ranges from bad (15% inaccurate) to very bad (50% inaccurate). The other methods, which require trained professionals to count points manually, are slow and expensive. This is a pretty major problem for using function points for anything other than research.
It is possible that automated tooling has gotten a lot better at counting function points since 1997. In fact, I'd say it's not just possible, but extremely probable. However, the IFPUG still trains and certifies manual function point counters, and does not seem to have any recommended automated alternatives.
But wait, what are all these different acronyms and types of function point counting in that table?
A rat king of competing standards
Jones actually lists no fewer than 38 different function point standards.
- The 1975 internal IBM function point method
- The 1979 published Albrecht IBM function point method
- The 1982 DeMarco bang function point method
- The 1983 Rubin/ESTIMACS function point method
- The 1983 British Mark II function point method (Symons)
- The 1984 revised IBM function point method
- The 1985 SPR function point method using three adjustment factors
- The 1985 SPR backfire function point method
- The 1986 SPR feature point method for real-time software
- The 1994 SPR approximation function point method
- The 1997 SPR analogy-based function point method
- The 1997 SPR taxonomy-based function point method
- The 1986 IFPUG Version 1 method
- The 1988 IFPUG Version 2 method
- The 1990 IFPUG Version 3 method
- The 1995 IFPUG Version 4 method
- The 1989 Texas Instruments IEF function point method
- The 1992 Reifer coupling of function points and Halstead metrics
- The 1992 ViaSoft backfire function point method
- The 1993 Gartner Group backfire function point method
- The 1994 Boeing 3D function point method
- The 1994 Object Point function point method
- The 1994 Bachman Analyst function point method
- The 1995 Compass Group backfire function point method
- The 1995 Air Force engineering function point method
- The 1995 Oracle function point method
- The 1995 NESMA function point method
- The 1995 ASMA function point method
- The 1995 Finnish function point method
- The 1996 CRIM micro-function point method
- The 1996 object point method
- The 1997 data point method for database sizing
- The 1997 Nokia function point approach for telecommunications software
- The 1997 full function point approach for real-time software
- The 1997 ISO working group rules for functional sizing
- The 1998 COSMIC function point approach
- The 1999 Story point method
- The 2003 Use Case point method
How did this come about?
In 1986, the International Function Point Users Group (IFPUG) was created and today includes more than 400 member companies in the United States alone, plus members in Europe, Australia, South America, and the Pacific Rim. The IFPUG counting practices committee is the de facto standards organization for function point counting methods in the United States. The IFPUG counting practices manual is the canonical reference manual for function point counting techniques in the United States and much of the world. However, for various reasons, the success of function point metrics spawned the creation of many minor variations. As of 2008, the author has identified about 38 of these variants, but this may not be the full list. No doubt at least another 20 variations exist that either have not been published or not published in journals yet reviewed by the author. This means that the same application can appear to have very different sizes, based on whether the function point totals follow the IFPUG counting rules, the British Mark II counting rules, COSMIC function point counting rules, object-point counting rules, the SPR feature point counting rules, the Boeing 3D counting rules, or any of the other function point variants. Thus, application sizing and cost estimating based on function point metrics must also identify the rules and definitions of the specific form of function point being utilized.
As is common practice, the creation of one attempted standard begat many more competing standards.
In spite of the considerable success of function point metrics in improving software quality and economic research, there are a number of important topics that still cannot be measured well or even measured at all in some cases. Here are some areas where the concept of function points need to be expanded to solve other measurement problems. Indeed an entire family of functional metrics can be envisioned.
It is hard for me not to read the last sentence of this paragraph with terrible dread.
Story points and function points
Circling back to our massive list of function point standards, you may notice something interesting— what the hell are story points doing there?
Jones is aware that Agile methodology estimates something about work in terms of "story points." The similarity between the phrases "function point" and "story point" gives him the false impression that they are counting the same kind of thing.
There are some alternate methods for deriving function point counts that are less expensive, although perhaps at the cost of reduced accuracy. [...] There are broad ranges in counting speeds and also in daily costs for every function point variation. [...E]ach "Agile story point" reflects a larger unit of work than a normal function point. A story point may be equal to at least 2 and perhaps more function points. A full Agile "story" may top 20 function points in size. The term Agile story points refers to a metric derived from Agile stories, the method used by some Agile projects for deriving requirements. An Agile story point is usually somewhat larger than a function point, and is perhaps roughly equivalent to two IFPUG function points, or perhaps even more.
Jones even believes there should be a formal rule to convert story points to function points and vice versa, and that agilists should be responsible for producing this rule.
Since several forms of function point metrics could be used with Agile projects, it would be useful to produce formal conversion rules between IFPUG, COSMIC, and NESMA function points and equivalent story points. Doing this would require formal function point counts of a sample of Agile stories, together with counts of story points. Ideally, this task should be performed by the Agile Alliance, on the general grounds that it is the responsibility of the developers of new metrics to provide conversion rules to older metrics.
One of the core flaws of this idea is that function points measure a property of the software that has been developed, whereas story points measure a property of the process building it. (E.g., if a complex function is built and then removed, it would be worth a total of 0 function points in the final product. However, the process of building it and the process of ripping it out are each worth a non-zero number of story points, which do not cancel each other out.) Furthermore, story points are used primarily to estimate what's going on in a project in a small window of the last, current, and next iterations. I'm used to 1-week iterations, so this window would be a total of 3 weeks—pretty far from the total overview of a codebase that function points attempt to provide. Lastly, I think that if you're trying to optimize any metric derived from story points, something has gone horribly wrong, and you need to reread the Agile manifesto.
Contra the Agile manifesto, Jones seems to argue that agilists should put greater value on processes and tools, especially when it comes to measuring project output. Yet he still has to admit that Agile works.
One of the weaknesses of the Agile approach is the widespread failure to measure projects using standard metrics such as function points. There are some Agile measures and estimating methods, but the metrics used such as "story points" have no available benchmarks. Lack of Agile measures include both productivity and quality. However, independent studies using function point metrics do indicate substantial productivity benefits from the Agile approach.
Science envy
Capers Jones started his career as editor of a medical journal. This has a clear impact on his worldview.
When examining the literature of other scientific disciplines such as medicine, physics, or chemistry, about half of the page space in journal articles is normally devoted to a discussion of how the measurements were taken. The remaining space discusses the conclusions that were reached. The software engineering literature is not so rigorous, and often contains no information at all on the measurement methods, activities included, or anything else that might allow the findings to be replicated by other researchers. The situation is so bad for software that some articles even in refereed journals do not explain the following basic factors of software measurement:It is professionally embarrassing for the software community to fail to identify such basic factors.
- Which activities are included in the results; i.e., whether all work was included or only selected activities
- The assumptions used for work periods such as work months
- The programming language or languages used for the application
- Whether size data using lines of code (LOC) is based on:
- Counts of physical lines
- Counts of logical statements
- Whether size data using function points is based on:
- IFPUG function point rules (and which version)
- Mark II function point rules (and which version)
- Other rules such as Boeing, SPR, DeMarco, IBM, or something else
- How the schedules were determined; i.e., what constituted the "start" of the project and what constituted the "end" of the project.
I agree that this is embarrassing for the authors and reviewers of peer-reviewed publications on software development (in settings where this is relevant—I don't think that e.g. theoretical researchers in distributed systems need to be worried about this). I don't think it's especially relevant to the world where almost all software development goes on. Or as friend of the blog Nat Bennett put it, "You are professionally embarrassed by my lack of measurement rigor. I ship software. We are not the same."
The author was formerly an editor of research papers and monographs in physics, chemistry, and medicine as well as in software. Unfortunately, most software managers and software engineers, including referees and journal editors, appear to have had no training at all in the basic principles of scientific writing and the design of experiments. The undergraduate curricula of physicists, physicians, and chemists often include at least a short course on scientific publication, but this seems to be omitted from software engineering curricula.
Physicists and chemists (even to some extent those working in industries such as pharmaceuticals) are scientists, not engineers. The role and goal of a scientist is to discover novel information or to confirm existing knowledge. The role and goal of a software engineer is to produce working software. Creating and maintaining a working process requires different skills, goals, and processes than discovering new facts about the world. In order to discover new knowledge, you need to be rigorous and precise about what it is you're measuring and how you're measuring it, and to communicate that knowledge to fellow scientists you need to painstakingly document the how, what, when, and where of your measurements. As anyone who has shipped software can attest: you do not need to do this to ship software.
Can an electrical engineering student matriculate through a normal curriculum without learning how to use an oscilloscope and volt-ohmmeter, or without learning the derivation of amps, ohms, joules, volts, henrys, and other standard metrics? Certainly not. Could a physician successfully pass through medical school without understanding the measurements associated with blood pressure, triglyceride levels, or even chromatographic analysis? Indeed not. Could a software engineer pass through a normal software engineering curriculum without learning about functional metrics, complexity metrics, or quality metrics? Absolutely. (While this chapter was being prepared, the author was speaking at a conference in San Jose, California. After the session, a member of the audience approached and said, "I'm a graduate software engineering student at Stanford University, and I've never heard of function points. Can you tell me what they are?")
Interestingly, the conclusion of Hillel Wayne's project which interviewed many engineers who had switched from traditional forms of engineering (chemical, nuclear, electrical, mechanical, etc.) to software engineering is that software engineers are really engineers. Has that much changed between 2008 and now, or did Hillel's informants pick up on something that Capers, a former journal editor and not a scientist or traditional engineer himself, did not?
Electrical engineering is not my line, so I can't speculate about whether or not you could do electrical engineering without using oscilloscopes and voltmeters, or without using any of the units that Capers mentions (or their imperial or cgs equivalents). Doing medicine without ever taking blood pressure or triglyceride levels, and without using any lab values derived from chromatography, seems to be in the realm of "possible but unnecessarily hard," like one of those YouTube videos where someone beats Skyrim without ever hitting an enemy. Also, it would probably get you in trouble with the board, and targeted by several or more malpractice suits.
This is a big contrast from doing software engineering without using metrics, a thing which is possible and done every day at scales both large and small and qualities both shitty and amazing.
Now, there is an argument to be made that just because software engineers go without those metrics doesn't mean that they should—that our current state of software engineering is like an era of medicine before lab levels or blood pressure measurements. I think this is the argument that Capers is trying to make.
A big part of the reason why I disagree with him on this is that most of these metrics don't really have to do with the software itself—they're about project management. You could use the same kind of metrics to measure the process of building a parking garage or designing a new chemical plant process—in fact, I think you could even use story points unmodified for this, if you wanted to be a pioneer in agilizing things that never asked to be agiled. Function points, Capers' pet metric, are (to his credit) actually a measure of something about software, but that "something" is of interest to management, not to the people actually building the system.
As a contrast, I talked to a systems engineer of my acquaintance about some of the values that she gathers when building a system of interest, such as a software-defined radio. These include measurements of how well the antenna performs—both emitting and receiving—at various distances, how the radio unit itself responds to high vibration, and other such physical performance measures. They notably don't include estimates of how long it would take for this radio to be produced at scale, or how much that would cost. Although these problems do affect the resources available to an engineer (e.g., you can't make your radio antenna out of pure unobtainium), they are fundamentally problems of logistics and project management, not engineering.
What Capers gets right
On zero-defect systems
The concepts of zero defects originated in aviation and defense companies in the late 1950s. Halpin wrote an excellent tutorial on the method in the mid-1960s. The approach, valid from a psychological viewpoint, is that if each worker and manager individually strives for excellence and zero defects, the final product has a good chance of achieving zero defects. The concept is normally supported by substantial public relations work in companies that adopt the philosophy. Interestingly, some zero-defect software applications of comparatively small size have been developed. Unfortunately, however, most software practitioners who live with software daily tend to doubt that the method can actually work for large or complex systems. There is certainly no empirical evidence that zero-defect large systems have yet been built.
I am also unable to find empirical evidence that zero-defect large systems have yet been built. I think this is a good point to make.
On Agile
Despite his failure to understand either story points or the importance Agile places on "individuals and interactions over processes and tools," Jones picks up on two real potential flaws of Agile: high rate of scope creep, and the fact that a lot of Agile shops depend on the intense, driven character of the people working there.
Domain | Average Monthly Rate of Creeping Requirements |
---|---|
End-user software | 0.5% |
Management information systems | 1.0% |
U.S. outsource software | 1.0% |
Commercial software | 3.5% |
Systems software | 2.0% |
Military software | 2.0% |
Web software | 5.0% |
Agile software | 10.0% |
The Agile approach seems to attract a young and very energetic set of developers, who bring with them a rather intense work ethic. A part of the productivity improvements from the Agile methods are based on the technologies themselves, and part are based on old-fashioned hard work and a lot of unpaid overtime.
On schedules
Jones correctly points out that schedule pressure can have an extremely destructive effect on a project, and that the people who quit because of bad management are likely to include the strongest engineers.
Of all of the factors that management can influence, and that management tends to be influenced by, schedule pressures stand out as the most significant. Some schedule pressure can actually benefit morale, but excessive or irrational schedules are probably the single most destructive influence in all of software. Not only do irrational schedules tend to kill the projects, but they cause extraordinarily high voluntary turnover among staff members. Even worse, the turnover tends to be greatest among the most capable personnel with the highest appraisals. Figure 4-6 illustrates the results of schedule pressure on staff morale.
On measuring lines of code (LOC)
Capers is, to his credit, very vehemently against using lines of source code as a metric for anything, and goes into great depth as to why this is a bad idea.
The subjectivity of "lines of source code" can be amusingly illustrated by the following analogy: Ask an obvious question such as, "Is the speed of light the same in the United States, Germany, and Japan?" Obviously the speed of light is the same in every country. Then ask the following question: "Is a line of source code the same in the United State, Germany, and Japan?" The answer to this question is, "No, it is not"—software articles and research in Germany have tended to use physical lines more often than logical statements, whereas the reverse is true for the United States and Japan.
I ragged on him earlier for a lot of things he had to say about what software engineering students do and don't learn, but this is a reasonable thing to point out:
There is a simple test for revealing inadequate metrics training at the university level. Ask a recent software engineering graduate from any U.S. university this question: "Did you learn that the 'lines of source code' metric tends to work backward when used with high-level languages?" Try this question on 100 graduates and you will probably receive about 98 "No" answers. It is embarrassing that graduate software engineers can enter the U.S. workforce without even knowing that the most widely utilized metric in software's history is seriously flawed.
I don't think this would reveal inadequate metrics training per se, but certainly inadequate training in dealing with management-driven bullshit.
And finally, maybe the most cogent thing in the book:
Malpractice is a serious situation and implies the usage of an approach known to be harmful under certain conditions, which should have been avoided through normal professional diligence. For example, a medical doctor who prescribed penicillin for a patient known to be allergic to that antibiotic is an illustration of professional malpractice. Using LOC and KLOC metrics to evaluate languages of different levels without cautioning about the paradoxical results that occur is unfortunately also an example of professional malpractice.I first thought this paragraph was a gross exaggeration, but after reflection, I think Capers was 100% right on this one. It might not kill people (usually) but it could definitely kill a project, or team morale.