Minutes of Weekly Meeting, 2009-04-20
Meeting called to order at 10:34 AM EDT
1. Roll Call
Eric Cormack
Ian McIntosh
Brian Erickson
Adam Ley (left 11:07))
Brad Van Treuren
Tim Pender
Carl Walker
Heiko Ehrenberg
Excused:
Patrick Au
2. Review and approve previous minutes:
4/13/2009 minutes:
- Draft circulated on 6th April:
- Corrections:
- In Review of Actions:
- Delete " - Discussed in topic 4b." from 7th action
- Change "as discussed today" to "as discussed Apr 6th" in 16th action
- Change "as discussed today" to "as discussed Apr 6th" in 17th action
- In Discussion Topic 4a:
- Change "[Brad] Yes, it seem very different ..." to "[Brad] Yes, it seems
very different ..."
- Brad moved to approve with the above amendments, Heiko seconded, no
objections.
3. Review old action items
- Adam proposed we cover the following at the next meeting:
- Establish consensus on goals and constraints
- What are we trying to achieve?
- What restrictions are we faced with?
- Establish whether TRST needs to be addressed as requirements in the ATCA
specification if it is not going to be managed globally (All)
- Adam review ATCA standard document for FRU's states
- Patrick contact Cadence for EDA support person.
- All to consider what data items are missing from Data Elements diagram
- All: do we feel SJTAG is requiring a new test language to obtain the
information needed for diagnostics or is STAPL/SVF sufficient?
see also Gunnar's presentation, in particular the new information he'd be
looking for in a test language
(http://files.sjtag.org/Ericsson-Nov2006/STAPL-Ideas.pdf)
- Carl W/Andrew: Set up conference call to organise review of Vol. 3 - Ongoing
- Andrew: Make contact with VXI Consortium/Charles Greenberg. - Ongoing
- Ian/Brad: Draft "straw man" Volume 4 for review - Ongoing
- All: Review "Role of Languages" in White Paper Volume 4 - Ongoing
- All: Consider structure/content of survey - Ongoing
- Harrison: Virtual system exploiting Configuration/Tuning/Instrumentation and
Root Cause Analysis/Failure Mode Analysis Use Cases. - Ongoing
- Brad: Virtual system exploiting POST and BIST Use Cases. - Ongoing
- Ian: Virtual system exploiting Environmental Stress Test Use Cases. - Ongoing
- Brad/Ian - Prepare draft survey for review by group. - Ongoing.
4. Discussion Topics
- Group Focus
- [Ian] Some of our recent discussions have maybe be going a bit deep and not
making the progress we'd like, so I thought we should step back a bit and use
the results from last year's survey to look at what others thought were the
important objectives for SJTAG.
- [Ian] Referring to Priority Objectives in the results summary on the web site
(http://www.sjtag.org/misc/results_aug08.html),
there are seven headings that show as being of greatest importance: Two of those
'Common Test Language' and 'Defined Data Formats' are subjects we've struggled
with a bit recently, so should we set those aside for now?
- [Eric] Yes, I think so.
- [Ian] System Diagnostics is then the next biggest topic. Can we consider what
that means for SJTAG?
- [Brad] I think 2 (System Diagnostics) and 3 (Reuse of Board Test Vectors) are
maybe related. Part of System Diagnostics is board diagnostics and that can
mean reusing board-level tests.
- [Brad] Look at Gunnar's system-level control of board-level tests, where the
tests are embedded in the board and you can say to a board "Go test yourself
and report what you find".
- [Brad] Compare that with the multidrop architecture where a Shelf Controller
is in charge of running tests that have been repurposed from the board-level.
In system JTAG, how do these different architectures impact System
Diagnostics? e.g. how do you report back which board is failing?
- [Ian] I think that the purely external test case is very similar to the
multidrop scenario you describe, Brad.
- [Brad] In the external case you have the benefit of access to the whole
tool set, but the principle is essentially similar.
- [Brad] Thinking back to the slides I presented a few weeks back (23rd Feb.)
where I described the typical functional test model with diagnostic software
interacting with diagnostics agents on each board: Gunnar's approach maps
quite neatly onto that, whereas the Shelf Controller maybe doesn't follow the
functional test model in terms of locality of control.
- [Tim] If you're looking at the Agent's perspective, it could be a set of
register values being returned for diagnostic evaluation. It then depends on
how you're decoding the data in those registers.
- [Brad] It's interesting that you talk about diagnostics based on a register
set. Mostly tooling looks at the super-vector for the whole chain. Is there
some way we leverage registers for our use?
- [Ian] Brad, is this similar to the discussion you had with Paul Holowko of BAE
Systems at ITC? I recall some discussion of fault codes and failure groups.
- [Brad] That's where I struggle; one of the issues we have to deal with is
faults in the field that are then No Fault Found when the FRU get to the
repair station. In those cases, having a snapshot of all the data may be very
important.
- [Tim] It could be a connection on the backplane; you change the FRU and re-
make the backplane connection so it works. I don't think you can rely on the
agent to have all the diagnostics.
- [Ian] I think that is what Brad was saying: in the case of an external
controller, extra diagnostic tools can be made available; but what about the
case of embedded controllers?
- [Brad] Right; external tooling should be able to partition result pattern to
identify Boundary Scan register cells, pins, and nets involved in a
particular fault.
- [Brad] If the same error occurs on multiple boards, is that due to
environmental issues, is it due to a problem on the backplane, etc? Having
recorded response pattern (from the system being employed in the field) will
be helpful in determining the root cause for certain faults.
- [Tim] Maybe those environmental test are the cases when you would use an
external control for the tests. Then you can have all the diagnostics tools
available.
- [Ian] I think Brad is referring to environmental effects in the field rather
than a controlled environmental test.
- [Brad] Correct. For example, we had seen (and identified) problems related to
high humidity that haven't been encountered before in controlled test
environments. Ian I guess you could have similar problems on radars if there's
a broken seal somewhere?
- [Ian] Yes, and I've seen space applications where poor processing results in
outgassing at connectors, causing opens.
- [Brad] Perhaps SJTAG should define some sort of SJTAG interface (maybe through
Ethernet or something that is commonly used) and defined data formats that
would allow external equipment to access internal system functions/data.
- [Ian] I just wonder if that might end up trespassing on some existing IP that
has been produced for similar functions.
- [Brad] If you'd define a software interface, a messaging protocol, you'd not
need a separate hardware interface, for example.
- [Tim] USB is basically a two-wire interface and it seems that it might suit
converting to 1149.7.
- {Adam's connection was dropped at this point}
- [Brad] Security would be an issue, especially if it is a shared interface also
used for system functions.
- [Brad] Something else I'd like to discuss is the difference in diagnostics
between interconnect tests and other operations like programming or
instrumentation.
- [Ian] Are you looking at the type of data or the format?
- [Brad] I'm looking at it from the test flow perspective.
- [Brad] There are primitives that are the same across Use Cases. At some point
we separate out to Use Case specifics. Where is that point? Is there
commonality in the core of diagnostic results for the different use cases? Or
is the split too early?
- [Ian] I'd say it is: Tooling for device programming typically won't give you
much diagnostic other that 'this vector failed'. But maybe I'm taking too
narrow a view here - thinking about it, this probably really a case that most
tooling is aimed at a specific Use Case, so while there may be diagnostic data
in the return vectors of a failed programming operation it may be getting
discarded.
- [Brad] Yes, I think that's what I see too. Fault analysis is the Utopia, but
we live with the fact that we just report that the pin state isn't what was
expected.
- [Brad] Instrumentation and programming, for example, can be compared in a
sense that both write to registers and results are read back from registers.
Interconnect tests can be looked at in a similar way. However, tools are
written for specific Use Cases; perhaps you can't tie them together.
- [Ian] It may be more historical than technical. Maybe the tool vendors have
more of a view on this?
- [Brad] Are we misrepresenting the position here? I'd be keen to hear what the
tool vendors have to say.
- [Heiko] I kind of agree. There are certain Use Cases that could be more
informative. But look at something like Flash programming. You could run all
the structural tests and cluster tests and find nothing wrong, but still get
a programming fail because a pin on the Flash isn't connected. What you get
may not represent the right type of error.
- [Brad] I'd agree, but are there some common data types or operations? Are
these coming down to some set of registers?
- [Brian] Sounds like interpretation of the resulting data. There seem to be two
issues: How to store result data in an onboard/embedded controller, and how to
analyze that data, e.g. you could report back which bits are failing.
I think we're more used to dealing with the primitives; trying to deal with
things at this level could be difficult.
- [Brad] Good point, we're more interested in where it fails.
- [Brian] You could say 'This location failed' and which vector it failed in.
- [Brad] There was an IEEE standard about representation of diagnostic data.
Teradyne were involved in it.
- [Ian] Possibly IEEE 1445, Standard for Digital Test Interchange Format [DTIF]?
- [Brad] That sounds like the right one. I'm not say we need to follow it, but
it may give us some pointers on what we need to think about.
- [Adam, proxy] Concerning the IEEE standard about representation of diagnostic
data, I presume that "STDF" (Standard Test Data Format) is what's intended
(ref http://stdf.nanoisi.com)
- [Brian] We have to be careful about possibly over-specifying. We would want to
make sure not to be too restrictive, so we don't miss out on some failure
information that may not fit the fault types in our definitions.
- [Brad] I think there's some value in defining some elements in a persistent
form.
- [Brian] The concern is also that the results are variable in size. Consider a
catastrophic failure, let's say a clock signal that is stuck, so every other
vector might fail, resulting in a lot of diagnostic data.
- [Brad] Typically we have an upper bound; if we exceed that then some alternate
strategy has to be used.
- [Tim] Also, we need to think about test flow control: Do you stop on fail or
continue? If you fail to configure LVDS pins would you want to continue if
there's a risk of damage? Under which circumstances do we want to continue a
test vs. skip a test vs. execute a specific test? Can we include some test
flow control information in the diagnostic pattern?
- [Ian] This is similar to issue I raised when we talked about EST: Critical
failures vs. noncritical failures. Criticality isn't something that JTAG
really has a concept of.
- [Brian] Probably have to some flag to indicate 'stop on first fail'.
- [Brad] We're able to provide an adornment in a layer above the SVF or STAPL to
help with this. And there are many cases where you do want to carry on and
complete all the tests.
- [Heiko] A simple example is checking the scan chain; there's little point in
carrying on if that fails.
- [Ian] I think we're going to have to continue this discussion; I think there's
more to come out of this yet.
- [Eric] Oh yes, we haven't really addressed multilayered diagnostics yet.
- [Brad] Also, we now see systems with a lot of redundancy, so I can bypass a
fault to keep running, maybe with reduced performance.
- [Eric] We touched on that when we discussed autodial up in the event of a
fault: Let the redundancy kick in, but contact base to report the fault.
There's more room for discussion.
- [Brad] When you get into network filters, with ganging of FPGAs and DSPs to
boost performance, you can lose one but still operate, just maybe not so well.
- [Ian] We can have multiple identical processing elements that can handoff
tasks in the event of failure or reduce functionality.
- [Brad] So can BScan help in these applications?
- [Carl] We even have nested hierarchies which provide similar features.
- [Brian] Do we need to address false failures? What about reconfiguring of an
FPGA after a corruption?
- [Brad] Maybe that's outside of our scope - a Diagnostics Manager.
- [Brian] OK, I just wanted see what we thought was in the scope of our Use
Cases.
- [Brad] Ultimate invocation may be in one of our Use Cases, but the decision
taking may be a system issue.
- [Ian] OK, I think we'll continue this discussion next week.
- April Newsletter
- [Ian] I've circulated a draft Newsletter. It's not due out until next week,
so I'm not proposing that we approve this yet.
- [Ian] I'd ask that you consider during the course of this week if there are
any other items we ought to add in.
5. Schedule next meeting
Schedule for April 2009:
Monday Apr. 27, 2009, 10:30 AM EDT
Schedule for May 2009:
Monday May 4, 2009, 10:30 AM EDT
Monday May 11, 2009, 10:30 AM EDT
Monday May 18, 2009, 10:30 AM EDT
6. Any other business
None
7. Review new action items
None
8. Adjourn
Moved to adjourn at 11:38 AM EDT by Eric, seconded by Carl.
Thanks to Heiko for supplying additional notes.
Respectfully submitted,
Ian McIntosh